remove_duplicates.remove_duplicates

remove_duplicates.remove_duplicates(responses, id_col, datetime_col)

Remove duplicate responses from a DataFrame containing survey data.

Parameters

Name Type Description Default
responses pd.DataFrame Pandas DataFrame to identify duplicate responses in. required
id_col str Name of the column with the unique identifiers. required
datetime_col str Name of the column containing the datetime when the survey was completed. required

Returns

Name Type Description
pd.DataFrame Cleaned, shuffled survey data containing only the most recent entry from each individual.

Raises

Name Type Description
TypeError If responses is not a pandas DataFrame. If id_col is not a string. If datetime_col is not a string.
KeyError If id_col does not exist in the DataFrame columns. If datetime_col does not exist in the DataFrame columns.
ValueError If id_col contains null values.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'respondent_id': [1, 2, 1, 3],
...     'completed_at': ['2024-01-01 10:00', '2024-01-01 11:00', 
...                      '2024-01-01 12:00', '2024-01-01 13:00'],
...     'answer': ['Yes', 'No', 'Maybe', 'Yes']
... })
>>> df['completed_at'] = pd.to_datetime(df['completed_at'])
>>> remove_duplicates(df, 'respondent_id', 'completed_at')
   respondent_id        completed_at answer
1              2 2024-01-01 11:00:00     No
2              1 2024-01-01 12:00:00  Maybe
3              3 2024-01-01 13:00:00    Yes