Prepare
- class neer_match_utilities.prepare.Prepare(similarity_map, df_left, df_right, id_left, id_right, spacy_pipeline='', additional_stop_words=[])[source]
A class for preparing and processing data based on similarity mappings.
The Prepare class inherits from SuperClass and provides functionality to clean, preprocess, and align two pandas DataFrames (df_left and df_right) based on a given similarity map. This is useful for data cleaning and ensuring data compatibility before comparison or matching operations.
Attributes:
- similarity_mapdict
A dictionary defining column mappings between the left and right DataFrames.
- df_leftpandas.DataFrame
The left DataFrame to be processed.
- df_rightpandas.DataFrame
The right DataFrame to be processed.
- id_leftstr
Column name representing unique IDs in the left DataFrame.
- id_rightstr
Column name representing unique IDs in the right DataFrame.
- spacy_pipelinestr
Name of the spaCy model loaded for NLP tasks (e.g., “en_core_web_sm”). If empty, no spaCy pipeline is used. (see https://spacy.io/models for avaiable models)
- additional_stop_wordslist of str
Extra tokens to mark as stop-words in the spaCy pipeline.
- __init__(similarity_map, df_left, df_right, id_left, id_right, spacy_pipeline='', additional_stop_words=[])[source]
- do_remove_stop_words(text)[source]
Removes stop words and non-alphabetic tokens from text.
- Parameters:
text (str) – The input text to process.
- Returns:
A space-separated string of unique lemmas after tokenization, lemmatization, and duplicate removal.
- Return type:
str
- format(fill_numeric_na=False, to_numeric=[], fill_string_na=False, capitalize=False, lower_case=False, remove_stop_words=False)[source]
Cleans, processes, and aligns the columns of two DataFrames (df_left and df_right).
This method applies transformations based on column mappings defined in similarity_map. It handles numeric and string conversions, fills missing values, and ensures consistent data types between the columns of the two DataFrames.
- Parameters:
fill_numeric_na (bool, optional) – If True, fills missing numeric values with 0 before conversion to numeric dtype. Default is False.
to_numeric (list, optional) – A list of column names to be converted to numeric dtype. Default is an empty list.
fill_string_na (bool, optional) – If True, fills missing string values with empty strings. Default is False.
capitalize (bool, optional) – If True, capitalizes string values in non-numeric columns. Default is False.
lower_case (bool, optional) – If True, uses lower-case string values in non-numeric columns. Default is False.
remove_stop_words (bool, optional) – If True, applies stop-word removal and lemmatization to non-numeric columns using the do_remove_stop_words method. Importantly, this only works if a proper Spacy pipeline is defined when initializing the Prepare object. Default is False.
- Returns:
A tuple containing the processed left (df_left_processed) and right (df_right_processed) DataFrames.
- Return type:
tuple[pandas.DataFrame, pandas.DataFrame]
Notes
- Columns are processed and aligned according to the similarity_map:
If both columns are numeric, their types are aligned.
If types differ, columns are converted to strings while preserving NaN.
Supports flexible handling of missing values and type conversions.
- neer_match_utilities.prepare.similarity_map_to_dict(items)[source]
Convert a list of similarity mappings into a dictionary representation.
The function accepts a list of tuples, where each tuple represents a mapping with the form (left, right, similarity). If the left and right column names are identical, the dictionary key is that column name; otherwise, the key is formed as left~right.
- Returns:
A dictionary where keys are column names (or left~right for differing columns) and values are lists of similarity functions associated with those columns.
- Return type:
dict
- neer_match_utilities.prepare.synth_mismatches(right, columns_fix, columns_change, str_metric, str_similarity_range, pct_diff_range, n_cols, n_mismatches=1, keep_missing=True, nan_share=0.0, empty_share=0.0, sample_share=1.0, id_right=None)[source]
Generate synthetic mismatches for a subset of rows in
right
.The output contains all original rows plus additional synthetic rows created by modifying selected columns. Synthetic rows are deduplicated against the original data and against each other.
- Parameters:
right (pandas.DataFrame) – The DataFrame containing the original (true) observations.
columns_fix (list of str) – Column names whose values remain unchanged (copied from the original row).
columns_change (list of str) – Column names whose values are modified to create mismatches.
str_metric (str) – Name of the string-similarity metric (key in
available_similarities()
).str_similarity_range (tuple of (float, float)) – Allowed range
(min_str_sim, max_str_sim)
for normalized string similarity of STRING columns incolumns_change
. Candidates must satisfymin_str_sim <= similarity(orig, candidate) <= max_str_sim
.pct_diff_range (tuple of (float, float)) – Allowed range
(min_pct_diff, max_pct_diff)
for percentage difference of NUMERIC columns incolumns_change
:abs(orig - candidate) / abs(orig)
. Iforig == 0
, anycandidate != 0
is treated as percentage difference 1.0.n_cols (int) – Number of columns from
columns_change
to modify per synthetic row. Ifn_cols < len(columns_change)
, pick that many at random. Ifn_cols > len(columns_change)
, all columns incolumns_change
are modified.n_mismatches (int, default=1) – Number of synthetic mismatches to generate per selected original row.
keep_missing (bool, default=True) – If True, preserve
NaN
or empty-string values incolumns_change
of the original row (no change applied to those cells).nan_share (float, default=0.0) – After deduplication, probability to inject
NaN
into each synthetic cell ofcolumns_change
.empty_share (float, default=0.0) – After deduplication, probability to inject
""
into each synthetic cell ofcolumns_change
. Applied afternan_share
.sample_share (float, default=1.0) – Proportion in
[0, 1]
of original rows inright
to select at random for synthetic generation. For example,0.5
selectsfloor(0.5 * n_rows)
.id_right (str or None, default=None) – Name of a unique-ID column in
right
. If provided, synthetic rows receive new UUID4 IDs in this column. If None, no ID column is created or modified.
- Returns:
Expanded DataFrame with the original rows plus the synthetic mismatch rows.
- Return type:
pandas.DataFrame
Notes
STRING columns in
columns_change
must meet the configured string-similarity range. If no candidate qualifies, the original string is perturbed until the similarity lies within the requested bounds.NUMERIC columns in
columns_change
must meet the configured percentage difference range. If no candidate qualifies, the value is perturbed toward the boundary (e.g.,orig * (1 ± min_pct_diff)
ororig * (1 ± max_pct_diff)
).After generating synthetics, any synthetic row whose modified data portion exactly matches an original row (or another synthetic row) is dropped.