remove_duplicates.remove_duplicates
remove_duplicates.remove_duplicates(responses, id_col, datetime_col)
Remove duplicate responses from a DataFrame containing survey data.
Parameters
| responses |
pd.DataFrame |
Pandas DataFrame to identify duplicate responses in. |
required |
| id_col |
str |
Name of the column with the unique identifiers. |
required |
| datetime_col |
str |
Name of the column containing the datetime when the survey was completed. |
required |
Returns
|
pd.DataFrame |
Cleaned, shuffled survey data containing only the most recent entry from each individual. |
Raises
|
TypeError |
If responses is not a pandas DataFrame. If id_col is not a string. If datetime_col is not a string. |
|
KeyError |
If id_col does not exist in the DataFrame columns. If datetime_col does not exist in the DataFrame columns. |
|
ValueError |
If id_col contains null values. |
Examples
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'respondent_id': [1, 2, 1, 3],
... 'completed_at': ['2024-01-01 10:00', '2024-01-01 11:00',
... '2024-01-01 12:00', '2024-01-01 13:00'],
... 'answer': ['Yes', 'No', 'Maybe', 'Yes']
... })
>>> df['completed_at'] = pd.to_datetime(df['completed_at'])
>>> remove_duplicates(df, 'respondent_id', 'completed_at')
respondent_id completed_at answer
1 2 2024-01-01 11:00:00 No
2 1 2024-01-01 12:00:00 Maybe
3 3 2024-01-01 13:00:00 Yes