Panel
- class neer_match_utilities.panel.GenerateID(df_panel, panel_var, time_var, model, similarity_map=None, prediction_threshold=0.9, subgroups=None, relation='m:m')[source]
A class to generate and harmonize unique IDs across time periods for panel data.
- group_by_subgroups():
Group the panel data into subgroups.
- generate_suggestions(df_slice):
Generate ID suggestions for consecutive time periods.
- harmonize_ids(suggestions, periods, original_df):
Harmonize IDs across time periods.
- assign_ids(id_mapping):
Assign unique IDs to the harmonized IDs.
- execute():
Execute the full ID generation and harmonization process.
- __init__(df_panel, panel_var, time_var, model, similarity_map=None, prediction_threshold=0.9, subgroups=None, relation='m:m')[source]
Initialize the GenerateID class.
- Parameters:
df_panel (pd.DataFrame) – The panel dataset.
panel_var (str) – The panel identifier variable that is supposed to be created.
time_var (str) – The time period variable.
subgroups (list, optional) – List of subgroup variables for slicing. Defaults to None.
model (object) – A model object with a suggest method.
similarity_map (dict) – A dictionary of similarity functions for columns.
prediction_threshold (float, optional) – Threshold for predictions. Defaults to 0.9.
relation (str, optional) – Relationship between observations in cross sectional data. Default is ‘m:m’
- assign_ids(id_mapping)[source]
Assign unique IDs to the harmonized IDs.
- Parameters:
id_mapping (pd.DataFrame) – The harmonized ID mapping dataframe.
- Returns:
Dataframe with assigned unique IDs.
- Return type:
pd.DataFrame
- execute()[source]
Execute the full ID generation and harmonization process.
- Returns:
The final ID mapping.
- Return type:
pd.DataFrame
- generate_suggestions(df_slice)[source]
Generate ID suggestions for consecutive time periods.
- Parameters:
df_slice (pd.DataFrame) – A dataframe slice containing data to process.
- Returns:
A tuple containing: - pd.DataFrame: A concatenated dataframe of suggestions. - list of int: A list of periods.
- Return type:
tuple
- group_by_subgroups()[source]
Group the panel data into subgroups.
- Returns:
Grouped dataframe by subgroups.
- Return type:
pd.core.groupby.generic.DataFrameGroupBy
- harmonize_ids(suggestions, periods, original_df)[source]
Harmonize IDs across time periods.
- Parameters:
suggestions (pd.DataFrame) – The dataframe with suggestions.
periods (list of int) – List of periods.
original_df (pd.DataFrame) – The original dataframe.
- Returns:
Harmonized ID mapping.
- Return type:
pd.DataFrame
- relations_left_right(df, relation=None)[source]
Apply validation rules to enforce relationships between matched observations.
- Parameters:
df (pd.DataFrame) – DataFrame containing ‘left’, ‘right’, and ‘prediction’ columns.
relation (str, optional) – Validation mode for relationships. If None, defaults to self.relation. Options: - ‘m:m’ : Many-to-many, no duplicates removed. - ‘1:m’ : Unique ‘left’ values. - ‘m:1’ : Unique ‘right’ values. - ‘1:1’ : Unique ‘left’ and ‘right’ values.
- Returns:
A reduced DataFrame with relationships enforced based on relation.
- Return type:
pd.DataFrame
- Raises:
ValueError – If relation is not one of [‘m:m’, ‘1:m’, ‘m:1’, ‘1:1’].
- class neer_match_utilities.panel.SetupData(matches=None)[source]
A class for processing and preparing data with overlapping matches and panel relationships.
- matches
A list of tuples representing matches.
- Type:
list
- __init__(matches=None)[source]
Initialize the SetupData class.
- Parameters:
matches (list, optional) – A list of tuples representing matches. Defaults to an empty list.
- adjust_overlap(dfm)[source]
Adjusts the overlap in the matches DataFrame by generating additional ordered pairs from connected components in the match graph.
This function takes a DataFrame containing match pairs in the ‘left’ and ‘right’ columns. It constructs a full connection graph from these matches, computes connected components using depth-first search (DFS), and then, for each connected component with more than one element, generates all ordered pairs (i.e. permutations) of distinct IDs. These new pairs are appended to the original DataFrame.
- Parameters:
dfm (pd.DataFrame) – A DataFrame with columns ‘left’ and ‘right’ representing the matching pairs, where each entry is a unique identifier.
- Returns:
A combined DataFrame that includes both the original match pairs and the newly generated pairs from connected components. Note that pairs of the form (A, A) are not generated, and for any distinct IDs A and B, both (A, B) and (B, A) may appear. Duplicates are not removed.
- Return type:
pd.DataFrame
Notes
Connected components are determined by treating the match pairs as edges in an undirected graph.
The function uses permutations of length 2 to generate ordered pairs, ensuring that only pairs of distinct IDs are created.
The new pairs are simply appended to the original DataFrame without dropping duplicates.
- create_connected_groups(df_dict, matches)[source]
Create a list of lists where sublists contain connected values as one group.
- Parameters:
df_dict (dict) – A dictionary where keys are integers and values are lists of integers.
matches (list of tuple of int) – A list of tuples representing connections between values.
- Returns:
A list of lists with connected values grouped together.
- Return type:
list of list of int
- data_preparation_panel(df_panel, unique_id, panel_id=None)[source]
Prepare data by handling overlaps, panel combinations, and duplicates.
This function converts the unique identifier column to a numeric type (if possible), then prepares match pairs by adjusting overlaps and dropping duplicate pairs. It then extracts the subset of the panel data corresponding to the matched IDs, and finally sorts the left and right DataFrames according to the unique identifier.
- Parameters:
df_panel (pd.DataFrame) – Panel DataFrame containing IDs and panel information.
unique_id (str) – Column name of unique identifiers in df_panel.
panel_id (str, optional) – Column name of panel identifiers in df_panel.
- Returns:
A tuple of three DataFrames: left, right, and the final matches.
- Return type:
tuple
- static drop_repetitions(df)[source]
Remove duplicate pairs in the DataFrame irrespective of the order of elements.
This function treats each pair in the ‘left’ and ‘right’ columns as unordered. It creates a temporary column ‘sorted_pair’ that contains a sorted tuple of the ‘left’ and ‘right’ values for each row. For example, the pairs (A, B) and (B, A) will both be transformed into (A, B), and only one instance will be retained. The function then drops duplicate rows based on this sorted pair and removes the temporary column before returning the result.
- Parameters:
df (pd.DataFrame) – A DataFrame containing at least the columns ‘left’ and ‘right’, which represent paired elements.
- Returns:
A DataFrame in which duplicate pairs (ignoring order) have been removed.
- Return type:
pd.DataFrame
- panel_preparation(dfm, df_panel, unique_id, panel_id)[source]
Generate combinations of IDs for each panel and append them to the DataFrame.
- Parameters:
dfm (pd.DataFrame) – DataFrame to append combinations to.
df_panel (pd.DataFrame) – Panel DataFrame containing IDs and panel information.
unique_id (str) – Column name of unique identifiers in df_panel.
panel_id (str) – Column name of panel identifiers in df_panel.
- Returns:
Updated DataFrame with appended combinations.
- Return type:
pd.DataFrame