remove_duplicates

remove_duplicates

Functions

Name	Description
remove_duplicates	Identifies and removes duplicate rows for a given dataframe. Optionally, returns a summary

remove_duplicates

remove_duplicates.remove_duplicates(df, cols=None, keep='first', report=False)

Identifies and removes duplicate rows for a given dataframe. Optionally, returns a summary report of the duplicate rows, including number of rows after removing duplicates, total number of rows remaining and the keep strategy used.

Parameters

Name	Type	Description	Default
df		Input DataFrame to check for and remove duplicate rows from.	required
cols		Optional parameter. Subset of column names to consider when identifying duplicates. If None, all columns are used to identify duplicate rows.	`None`
keep	(first, last, False)	Optional parameter. Determines which duplicate rows to keep: - “first”: keep the first occurrence and remove subsequent duplicates - “last”: keep the last occurrence and remove earlier duplicates - False: remove all duplicate rows	`"first"`
report	bool	Optional parameter. If True, returns a summary report containing information about duplicate rows.	`False`

Returns

Name	Type	Description
	pandas.DataFrame	The input DataFrame with duplicate rows removed.
	(dict, optional)	If `report` is True, a dictionary summarizing duplicate detection results is returned. The dictionary output includes: - total_rows: int, number of rows in the original input DataFrame - duplicate_rows: int, number of duplicate rows detected - rows_removed: int, number of rows removed - strategy: keep strategy used to remove duplicates, if any - cols_used: list of str or None(i.e. all), columns used for duplicate detection

Raises

TypeError If df is not a pandas DataFrame. KeyError If any column in cols does not exist in the DataFrame. ValueError If keep is not one of {“first”, “last”, False}.

Notes

This function is intended for early-stage data cleaning and EDA processes.
Missing values are considered in duplicate detection. If two or more rows contain missing values in the same places, they are still considered duplicates.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "id": [1, 1, 2, 3],
...     "value": ["A", "A", "B", "C"]
... })

Remove duplicate rows using all columns:

>>> cleaned_df = remove_duplicates(df)
>>> cleaned_df
   id value
0   1     A
2   2     B
3   3     C

Remove duplicates based on some specified columns and return a summary report:

>>> cleaned_df, report = remove_duplicates(df,cols=["id"],keep="last",report=True)
>>> cleaned_df
   id value
1   1     A
2   2     B
3   3     C

>>> report
{'total_rows': 4,
'duplicate_rows': 1,
'rows_removed': 1,
'cols_used': ['id']}