Tutorial

This tutorial demonstrates an example workflow for how to use the data_fixr package to clean and analyze your data.

Installation

First, install the package from Test PyPI:

pip install -i https://test.pypi.org/simple/ data-fixr

Getting Started

Import the functions you need:

from data_fixr import (
    correlation_report,
    remove_duplicates,
    detect_anomalies,
    missing_values
)
import pandas as pd

Load Sample Data

Let’s create a sample dataset to demonstrate the package functionality:

# Create sample data with some issues
data = pd.DataFrame({
    'age': [25, 30, 25, 150, 35, 30, 28, 45, 32, 29],  # 150 is an anomaly
    'income': [50000, 60000, 50000, 70000, None, 60000, 55000, 85000, 62000, 58000], #contains missing value and duplicates
    'years_experience': [3, 7, 3, 25, 10, 7, 5, 20, 8, 6]
})
data
age income years_experience
0 25 50000.0 3
1 30 60000.0 7
2 25 50000.0 3
3 150 70000.0 25
4 35 NaN 10
5 30 60000.0 7
6 28 55000.0 5
7 45 85000.0 20
8 32 62000.0 8
9 29 58000.0 6

Detecting Missing Values

Check for missing values in the data and fill them in with method: ‘mode’ To see alternative method values, navigate to the reference page.

missing_report = missing_values(data, method='mode')
missing_report
(   age   income  years_experience
 0   25  50000.0                 3
 1   30  60000.0                 7
 2   25  50000.0                 3
 3  150  70000.0                25
 4   35  50000.0                10
 5   30  60000.0                 7
 6   28  55000.0                 5
 7   45  85000.0                20
 8   32  62000.0                 8
 9   29  58000.0                 6,
 np.float64(3.3333333333333335))

As well as filling in the missing values with the specified method, the function also returns a column with the percentage of total DataFrame values that were originally missing and have been filled.

Removing Duplicates

Identify and remove duplicate rows, keeping the first instance of the duplicated observation:

clean_data = remove_duplicates(data, keep='first', report=True)
clean_data
(   age   income  years_experience
 0   25  50000.0                 3
 1   30  60000.0                 7
 3  150  70000.0                25
 4   35      NaN                10
 6   28  55000.0                 5
 7   45  85000.0                20
 8   32  62000.0                 8
 9   29  58000.0                 6,
 {'total_rows': 10,
  'duplicate_rows': 2,
  'rows_removed': 2,
  'strategy': 'first',
  'cols_used': None})

As well as removing the duplicated observation, the function also produces a report of the the number of duplicates detected, number of rows removed, strategy used etc. To see full details of the report parameter feature, see the reference page.

Detecting Anomalies

Find outliers or anomalous values with default zscore method in the age and years experience columns:

anomalies = detect_anomalies(data[['age', 'years_experience']])
anomalies
(   age  years_experience  age_outlier  years_experience_outlier
 0   25                 3        False                     False
 1   30                 7        False                     False
 2   25                 3        False                     False
 3  150                25         True                      True
 4   35                10        False                     False
 5   30                 7        False                     False
 6   28                 5        False                     False
 7   45                20        False                     False
 8   32                 8        False                     False
 9   29                 6        False                     False,
 np.float64(10.0))

The result returned is a tuple of a dataframe summarizing which obervations are outliers using boolean (True/False) column, along with an outlier percnetage score across all numeric columns.

Note: We exclude the income column here because it contains missing values. The detect_anomalies function requires complete data for anomaly detection. You can handle missing values first using the missing_values function.

The function can also use the ‘iqr’ method for detecting anomalies, see reference page for more information.

Correlation Analysis

Generate a correlation report showing pairwise correlations between all numeric variables in the data using the pearson correlation method:

corr_report = correlation_report(data, method='spearman')
corr_report
feature_1 feature_2 correlation abs_correlation
0 age income 0.983051 0.983051
1 age years_experience 1.000000 1.000000
2 income years_experience 0.983051 0.983051

Each row in the returned report represents the correlation value for each unique pair of numeric features. An absolute correlation value is also computed. Navigate to the reference page to explore alternative values for the method parameter.