find_duplicates

find_duplicates

Functions

Name Description
find_duplicates Identify duplicate rows in a pandas DataFrame.

find_duplicates

find_duplicates.find_duplicates(data, subset=None, keep='first')

Identify duplicate rows in a pandas DataFrame.

This function returns the rows that are considered duplicates according to the specified subset of columns. Rows are considered duplicates if they have identical values across the specified columns, following pandas equality and NaN-handling semantics.

Parameters

Name Type Description Default
data pandas.DataFrame Input DataFrame to check for duplicate rows. required
subset list of str List of column names to consider when identifying duplicates. If None, all columns are used. All column names must exist in data, and the list must not be empty. None
keep (first, last, False) Determines which duplicates are returned: - ‘first’ : Return duplicates except for the first occurrence. - ‘last’ : Return duplicates except for the last occurrence. - False : Return all duplicate rows. 'first'

Returns

Name Type Description
pandas.DataFrame A new DataFrame containing only the rows identified as duplicates, with the index reset to a default RangeIndex. If no duplicate rows are found, an empty DataFrame with the same columns as data is returned.

Raises

Name Type Description
TypeError If data is not an instance of pandas.DataFrame. If subset is not None and is not a list of strings.
ValueError If subset contains columns not present in data or if keep is not one of {‘first’, ‘last’, False}.

Examples

>>> df = pd.DataFrame({
...     "A": [1, 1, 2],
...     "B": [3, 3, 4]
... })
>>> find_duplicates(df)
   A  B
1  1  3
>>> find_duplicates(df, keep=False)
   A  B
0  1  3
1  1  3