Split
- exception neer_match_utilities.split.SplitError[source]
Custom exception for errors in data splitting.
- neer_match_utilities.split.split_test_train(left, right, matches, test_ratio=0.3, validation_ratio=0.1)[source]
Splits datasets into training, validation, and testing subsets.
This function ensures that only observations from left and right that are referenced in the matches DataFrame are included in the split process.
- Parameters:
left (pd.DataFrame) – The left dataset to split.
right (pd.DataFrame) – The right dataset to split.
matches (pd.DataFrame) – A DataFrame containing matching pairs between the left and right datasets. It must include columns ‘left’ and ‘right’, referencing indices in left and right.
test_ratio (float, optional) – The proportion of the data to be used for testing (default is 0.3).
validation_ratio (float, optional) – The proportion of the data to be used for validation (default is 0.1).
- Returns:
A tuple containing: - left_train : pd.DataFrame - right_train : pd.DataFrame - matches_train : pd.DataFrame - left_validation : pd.DataFrame - right_validation : pd.DataFrame - matches_validation : pd.DataFrame - left_test : pd.DataFrame - right_test : pd.DataFrame - matches_test : pd.DataFrame
- Return type:
tuple
- Raises:
SplitError – If the total counts of split subsets do not match the original dataset size.