find_duplicates
find_duplicates
Functions
| Name | Description |
|---|---|
| find_duplicates | Identify duplicate rows in a pandas DataFrame. |
find_duplicates
find_duplicates.find_duplicates(data, subset=None, keep='first')Identify duplicate rows in a pandas DataFrame.
This function returns the rows that are considered duplicates according to the specified subset of columns. Rows are considered duplicates if they have identical values across the specified columns, following pandas equality and NaN-handling semantics.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| data | pandas.DataFrame | Input DataFrame to check for duplicate rows. | required |
| subset | list of str | List of column names to consider when identifying duplicates. If None, all columns are used. All column names must exist in data, and the list must not be empty. |
None |
| keep | (first, last, False) | Determines which duplicates are returned: - ‘first’ : Return duplicates except for the first occurrence. - ‘last’ : Return duplicates except for the last occurrence. - False : Return all duplicate rows. | 'first' |
Returns
| Name | Type | Description |
|---|---|---|
| pandas.DataFrame | A new DataFrame containing only the rows identified as duplicates, with the index reset to a default RangeIndex. If no duplicate rows are found, an empty DataFrame with the same columns as data is returned. |
Raises
| Name | Type | Description |
|---|---|---|
| TypeError | If data is not an instance of pandas.DataFrame. If subset is not None and is not a list of strings. |
|
| ValueError | If subset contains columns not present in data or if keep is not one of {‘first’, ‘last’, False}. |
Examples
>>> df = pd.DataFrame({
... "A": [1, 1, 2],
... "B": [3, 3, 4]
... })
>>> find_duplicates(df)
A B
1 1 3>>> find_duplicates(df, keep=False)
A B
0 1 3
1 1 3