detect_anomalies

detect_anomalies

Functions

Name	Description
detect_anomalies	This function flags outliers in numeric columns using either the

detect_anomalies

detect_anomalies.detect_anomalies(df, method='zscore')

This function flags outliers in numeric columns using either the Z-score method or the IQR method.

Outliers in a dataset can heavily impact our analysis negatively. This function helps to identify potential anomalies in numeric columns of a pandas DataFrame, whether the data is normally distributed or skewed.

The function takes in a DataFrame, automatically identifies numeric columns, and applies the specified method to flag outliers. Each numeric column is analyzed independently to detect anomalous values.

The z-score method is suitable for normally distributed data but sensitive to extreme outliers, while the IQR method is better for skewed distributions and robust to extreme values.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	The DataFrame containing numeric columns to analyze for anomalies. Non-numeric columns will be excluded from the analysis.	required
method	str	The anomaly detection method to use. Valid options are: - ‘zscore’ : For normally distributed data - ‘iqr’ : For skewed data or robust detection	`'zscore'`

Returns

Name	Type	Description
	tuple of (pd.DataFrame, float)	result_df : pd.DataFrame A DataFrame containing only the numeric columns plus additional boolean columns (named as ’{column}_outlier’) indicating whether each value is an outlier. True indicates an outlier, False indicates a normal value. outlier_percentage : float The percentage of outliers detected across all numeric columns, calculated as (total outliers / total values) * 100.

Raises

Name	Type	Description
	ValueError	If method is not ‘zscore’ or ‘iqr’, or if any numeric column contains fewer than 3 data points, or if any numeric column contains NaN values.
	TypeError	If df is not a pandas DataFrame or contains no numeric columns.

Notes

Z-score method: Identifies points that are more than 2 standard deviations away from the mean. The z-score is calculated as: z = (x - mean) / std. A data point is flagged as an outlier if |z| > 2.

IQR method: Uses the interquartile range to identify outliers. Calculates Q1 (25th percentile) and Q3 (75th percentile), then IQR = Q3 - Q1. A data point is flagged as an outlier if it falls below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

Assumptions: - Requires at least 3 data points per numeric column for meaningful analysis - Non-numeric columns are automatically excluded - Missing values (NaN) are not flagged as outliers but will raise an error if present in numeric columns.

Examples

>>> import pandas as pd
>>> # Create sample data with clear outliers
>>> df = pd.DataFrame({
...     'temperature': [20, 21, 22, 19, 98, 23],
...     'humidity': [45, 50, 48, 52, 49, 200],
...     'location': ['A', 'B', 'C', 'D', 'E', 'F']
... })
>>> result_df, pct = detect_anomalies(df, method='zscore')
>>> print(result_df)
   temperature  humidity  temperature_outlier  humidity_outlier
0           20        45                False             False
1           21        50                False             False
2           22        48                False             False
3           19        52                False             False
4           98        49                 True             False
5           23       200                False              True
>>> print(f"{pct:.1f}% of data points are outliers")
16.7% of data points are outliers