find_duplicates

find_duplicates

Functions

Name	Description
find_duplicates	Identify duplicate rows in a pandas DataFrame.

find_duplicates

find_duplicates.find_duplicates(data, subset=None, keep='first')

Identify duplicate rows in a pandas DataFrame.

This function returns the rows that are considered duplicates according to the specified subset of columns. Rows are considered duplicates if they have identical values across the specified columns, following pandas equality and NaN-handling semantics.

Parameters

Name	Type	Description	Default
data	pandas.DataFrame	Input DataFrame to check for duplicate rows.	required
subset	list of str	List of column names to consider when identifying duplicates. If None, all columns are used. All column names must exist in `data`, and the list must not be empty.	`None`
keep	(first, last, False)	Determines which duplicates are returned: - ‘first’ : Return duplicates except for the first occurrence. - ‘last’ : Return duplicates except for the last occurrence. - False : Return all duplicate rows.	`'first'`

Returns

Name	Type	Description
	pandas.DataFrame	A new DataFrame containing only the rows identified as duplicates, with the index reset to a default RangeIndex. If no duplicate rows are found, an empty DataFrame with the same columns as `data` is returned.

Raises

Name	Type	Description
	TypeError	If `data` is not an instance of pandas.DataFrame. If `subset` is not None and is not a list of strings.
	ValueError	If `subset` contains columns not present in `data` or if `keep` is not one of {‘first’, ‘last’, False}.

Examples

>>> df = pd.DataFrame({
...     "A": [1, 1, 2],
...     "B": [3, 3, 4]
... })
>>> find_duplicates(df)
   A  B
1  1  3

>>> find_duplicates(df, keep=False)
   A  B
0  1  3
1  1  3