Prepare
- class neer_match_utilities.prepare.Prepare(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]
A class for preparing and processing data based on similarity mappings.
The Prepare class inherits from SuperClass and provides functionality to clean, preprocess, and align two pandas DataFrames (df_left and df_right) based on a given similarity map. This is useful for data cleaning and ensuring data compatibility before comparison or matching operations.
Attributes:
- similarity_mapdict
A dictionary defining column mappings between the left and right DataFrames.
- df_leftpandas.DataFrame
The left DataFrame to be processed.
- df_rightpandas.DataFrame
The right DataFrame to be processed.
- id_leftstr
Column name representing unique IDs in the left DataFrame.
- id_rightstr
Column name representing unique IDs in the right DataFrame.
- format(fill_numeric_na=False, to_numeric=[], fill_string_na=False, capitalize=False)[source]
Cleans, processes, and aligns the columns of two DataFrames (df_left and df_right).
This method applies transformations based on column mappings defined in similarity_map. It handles numeric and string conversions, fills missing values, and ensures consistent data types between the columns of the two DataFrames.
- Parameters:
fill_numeric_na (bool, optional) – If True, fills missing numeric values with 0 before conversion to numeric dtype. Default is False.
to_numeric (list, optional) – A list of column names to be converted to numeric dtype. Default is an empty list.
fill_string_na (bool, optional) – If True, fills missing string values with empty strings. Default is False.
capitalize (bool, optional) – If True, capitalizes string values in non-numeric columns. Default is False.
- Returns:
A tuple containing the processed left (df_left_processed) and right (df_right_processed) DataFrames.
- Return type:
tuple[pandas.DataFrame, pandas.DataFrame]
Notes
- Columns are processed and aligned according to the similarity_map:
If both columns are numeric, their types are aligned.
If types differ, columns are converted to strings while preserving NaN.
Supports flexible handling of missing values and type conversions.
- neer_match_utilities.prepare.similarity_map_to_dict(items)[source]
Convert a list of similarity mappings into a dictionary representation.
The function accepts a list of tuples, where each tuple represents a mapping with the form (left, right, similarity). If the left and right column names are identical, the dictionary key is that column name; otherwise, the key is formed as left~right.
- Returns:
A dictionary where keys are column names (or left~right for differing columns) and values are lists of similarity functions associated with those columns.
- Return type:
dict