Training

class neer_match_utilities.training.Training(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]

A class for managing and evaluating training processes, including reordering matches, evaluating performance metrics, and exporting models.

Inherits:

SuperClass : Base class providing shared attributes and methods.

evaluate_dataframe(evaluation_test, evaluation_train)[source]

Combines and evaluates test and training performance metrics.

Parameters:
  • evaluation_test (dict) – Dictionary containing performance metrics for the test dataset.

  • evaluation_train (dict) – Dictionary containing performance metrics for the training dataset.

Returns:

A DataFrame with accuracy, precision, recall, F-score, and a timestamp for both test and training datasets.

Return type:

pd.DataFrame

matches_reorder(matches, matches_id_left, matches_id_right)[source]

Reorders a matches DataFrame to include indices from the left and right DataFrames instead of their original IDs.

Parameters:
  • matches (pd.DataFrame) – DataFrame containing matching pairs.

  • matches_id_left (str) – Column name in the matches DataFrame corresponding to the left IDs.

  • matches_id_right (str) – Column name in the matches DataFrame corresponding to the right IDs.

Returns:

A DataFrame with columns left and right, representing the indices of matching pairs in the left and right DataFrames.

Return type:

pd.DataFrame

performance_statistics_export(model, model_name, target_directory, evaluation_train={}, evaluation_test={})[source]

Exports the trained model, similarity map, and evaluation metrics to the specified directory.

Parameters:

modelModel object

The trained model to export.

model_namestr

Name of the model to use as the export directory name.

target_directoryPath

The target directory where the model will be exported.

evaluation_traindict, optional

Performance metrics for the training dataset (default is {}).

evaluation_testdict, optional

Performance metrics for the test dataset (default is {}).

Returns:

: None

Notes:

  • The method creates a subdirectory named after model_name inside target_directory.

  • If evaluation_train and evaluation_test are provided, their metrics are saved as a CSV file.

  • Similarity maps are serialized using dill and saved in the export directory.

neer_match_utilities.training.focal_loss(alpha=0.25, gamma=2.0)[source]

Focal Loss function for binary classification tasks.

Focal Loss is designed to address class imbalance by assigning higher weights to the minority class and focusing the model’s learning on hard-to-classify examples. It reduces the loss contribution from well-classified examples, making it particularly effective for imbalanced datasets.

Parameters:
  • alpha (float, optional, default=0.25) –

    Weighting factor for the positive class (minority class).

    • Must be in the range [0, 1].

    • A higher value increases the loss contribution from the positive class (underrepresented class) relative to the negative class (overrepresented class).

  • gamma (float, optional, default=2.0) –

    Focusing parameter that reduces the loss contribution from easy examples.

    • gamma = 0: No focusing, equivalent to Weighted Binary Cross-Entropy Loss.

    • gamma > 0: Focuses more on hard-to-classify examples.

    • Larger values emphasize harder examples more strongly.

Returns:

loss – A loss function that computes the focal loss given the true labels (y_true) and predicted probabilities (y_pred).

Return type:

callable

Raises:

ValueError – If alpha is not in the range [0, 1].

Notes

  • The positive class (minority or underrepresented class) is weighted by alpha.

  • The negative class (majority or overrepresented class) is automatically weighted by 1 - alpha.

  • Ensure alpha is set appropriately to reflect the level of imbalance in the dataset.

References

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. In ICCV.

Explanation of Key Terms

  • Positive Class (Underrepresented):

    • Refers to the class with fewer examples in the dataset.

    • Typically weighted by alpha, which should be greater than 0.5 in highly imbalanced datasets.

  • Negative Class (Overrepresented):

    • Refers to the class with more examples in the dataset.

    • Its weight is automatically 1 - alpha.