from data_fixr import (
correlation_report,
remove_duplicates,
detect_anomalies,
missing_values
)
import pandas as pdTutorial
This tutorial demonstrates an example workflow for how to use the data_fixr package to clean and analyze your data.
Installation
First, install the package from Test PyPI:
pip install -i https://test.pypi.org/simple/ data-fixrGetting Started
Import the functions you need:
Load Sample Data
Let’s create a sample dataset to demonstrate the package functionality:
# Create sample data with some issues
data = pd.DataFrame({
'age': [25, 30, 25, 150, 35, 30, 28, 45, 32, 29], # 150 is an anomaly
'income': [50000, 60000, 50000, 70000, None, 60000, 55000, 85000, 62000, 58000], #contains missing value and duplicates
'years_experience': [3, 7, 3, 25, 10, 7, 5, 20, 8, 6]
})
data| age | income | years_experience | |
|---|---|---|---|
| 0 | 25 | 50000.0 | 3 |
| 1 | 30 | 60000.0 | 7 |
| 2 | 25 | 50000.0 | 3 |
| 3 | 150 | 70000.0 | 25 |
| 4 | 35 | NaN | 10 |
| 5 | 30 | 60000.0 | 7 |
| 6 | 28 | 55000.0 | 5 |
| 7 | 45 | 85000.0 | 20 |
| 8 | 32 | 62000.0 | 8 |
| 9 | 29 | 58000.0 | 6 |
Detecting Missing Values
Check for missing values in the data and fill them in with method: ‘mode’ To see alternative method values, navigate to the reference page.
missing_report = missing_values(data, method='mode')
missing_report( age income years_experience
0 25 50000.0 3
1 30 60000.0 7
2 25 50000.0 3
3 150 70000.0 25
4 35 50000.0 10
5 30 60000.0 7
6 28 55000.0 5
7 45 85000.0 20
8 32 62000.0 8
9 29 58000.0 6,
np.float64(3.3333333333333335))
As well as filling in the missing values with the specified method, the function also returns a column with the percentage of total DataFrame values that were originally missing and have been filled.
Removing Duplicates
Identify and remove duplicate rows, keeping the first instance of the duplicated observation:
clean_data = remove_duplicates(data, keep='first', report=True)
clean_data( age income years_experience
0 25 50000.0 3
1 30 60000.0 7
3 150 70000.0 25
4 35 NaN 10
6 28 55000.0 5
7 45 85000.0 20
8 32 62000.0 8
9 29 58000.0 6,
{'total_rows': 10,
'duplicate_rows': 2,
'rows_removed': 2,
'strategy': 'first',
'cols_used': None})
As well as removing the duplicated observation, the function also produces a report of the the number of duplicates detected, number of rows removed, strategy used etc. To see full details of the report parameter feature, see the reference page.
Detecting Anomalies
Find outliers or anomalous values with default zscore method in the age and years experience columns:
anomalies = detect_anomalies(data[['age', 'years_experience']])
anomalies( age years_experience age_outlier years_experience_outlier
0 25 3 False False
1 30 7 False False
2 25 3 False False
3 150 25 True True
4 35 10 False False
5 30 7 False False
6 28 5 False False
7 45 20 False False
8 32 8 False False
9 29 6 False False,
np.float64(10.0))
The result returned is a tuple of a dataframe summarizing which obervations are outliers using boolean (True/False) column, along with an outlier percnetage score across all numeric columns.
Note: We exclude the income column here because it contains missing values. The detect_anomalies function requires complete data for anomaly detection. You can handle missing values first using the missing_values function.
The function can also use the ‘iqr’ method for detecting anomalies, see reference page for more information.
Correlation Analysis
Generate a correlation report showing pairwise correlations between all numeric variables in the data using the pearson correlation method:
corr_report = correlation_report(data, method='spearman')
corr_report| feature_1 | feature_2 | correlation | abs_correlation | |
|---|---|---|---|---|
| 0 | age | income | 0.983051 | 0.983051 |
| 1 | age | years_experience | 1.000000 | 1.000000 |
| 2 | income | years_experience | 0.983051 | 0.983051 |
Each row in the returned report represents the correlation value for each unique pair of numeric features. An absolute correlation value is also computed. Navigate to the reference page to explore alternative values for the method parameter.