Prepare

class neer_match_utilities.prepare.Prepare(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]

A class for preparing and processing data based on similarity mappings.

The Prepare class inherits from SuperClass and provides functionality to clean, preprocess, and align two pandas DataFrames (df_left and df_right) based on a given similarity map. This is useful for data cleaning and ensuring data compatibility before comparison or matching operations.

Attributes:

similarity_mapdict

A dictionary defining column mappings between the left and right DataFrames.

df_leftpandas.DataFrame

The left DataFrame to be processed.

df_rightpandas.DataFrame

The right DataFrame to be processed.

id_leftstr

Column name representing unique IDs in the left DataFrame.

id_rightstr

Column name representing unique IDs in the right DataFrame.

format(fill_numeric_na=False, to_numeric=[], fill_string_na=False, capitalize=False)[source]

Cleans, processes, and aligns the columns of two DataFrames (df_left and df_right).

This method applies transformations based on column mappings defined in similarity_map. It handles numeric and string conversions, fills missing values, and ensures consistent data types between the columns of the two DataFrames.

Parameters:
  • fill_numeric_na (bool, optional) – If True, fills missing numeric values with 0 before conversion to numeric dtype. Default is False.

  • to_numeric (list, optional) – A list of column names to be converted to numeric dtype. Default is an empty list.

  • fill_string_na (bool, optional) – If True, fills missing string values with empty strings. Default is False.

  • capitalize (bool, optional) – If True, capitalizes string values in non-numeric columns. Default is False.

Returns:

A tuple containing the processed left (df_left_processed) and right (df_right_processed) DataFrames.

Return type:

tuple[pandas.DataFrame, pandas.DataFrame]

Notes

  • Columns are processed and aligned according to the similarity_map:
    • If both columns are numeric, their types are aligned.

    • If types differ, columns are converted to strings while preserving NaN.

  • Supports flexible handling of missing values and type conversions.

neer_match_utilities.prepare.similarity_map_to_dict(items)[source]

Convert a list of similarity mappings into a dictionary representation.

The function accepts a list of tuples, where each tuple represents a mapping with the form (left, right, similarity). If the left and right column names are identical, the dictionary key is that column name; otherwise, the key is formed as left~right.

Returns:

A dictionary where keys are column names (or left~right for differing columns) and values are lists of similarity functions associated with those columns.

Return type:

dict