Welcome to group_15_data_wrangling

codecov
Package Latest PyPI Version Supported Python Versions
Meta Code of Conduct

Package Summary

This package aim to simplify the data wrangling for the Adult Census Income dataset found here. This will make it easier for someone who want to work with the data to quickly get the dataset in a clean format where they can then start analysis right away.

Functions

  • set_dtype()
    • sets the data types for each column in the dataset to reduce memory requirements
  • cat_mode_impute()
    • performs imputation of missing values
  • clean_col_name()
    • replace “.” with “-” and make some names more meaningful
  • encode_income_binary()
    • encode the target feature, income, as binary

See all function documentation here: Function documentation

How this package fits into python ecosystem

There is an existant package called pyjanitor that has many useful data cleaning routines. These are general purpose and very powerful. Our package is much more focused on cleaning the specific adult census income dataset. In general pyjanitor is a much more useful package, but to tidy our specific dataset our functions will probably do the job with less effort.

pyjanitor

Get started

You can install this package into your preferred Python environment using pip:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple group_15_data_wrangling

The data can be downloaded from Kaggle, or downloaded from UCIrvine wich you can get by using:

pip install ucimlrepo

To use group_15_data_wrangling in your code:

# imports
from group_15_data_wrangling import cat_mode_impute, clean_col_name, encode_income_binary, set_dtype
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# census data as pandas dataframes
X = adult.data.features 
y = adult.data.targets 
census_df = X.join(y)

# use function from this package
census_df_no_nan = cat_mode_impute(census_df)
census_df_clean_names = clean_col_name(census_df)
census_df_binary_income = encode_income_binary(census_df)
census_df_consise_dtypes = set_dtype(census_df)

Documentation

Documentation

Contributing

Github repo

Repo

Get started contributing

git clone <repo> # clone group_15_data_wrangling repo
cd group_15_data_wrangling # cd into the project directory
conda env create -f environment.yml # setup dev environment
conda activate group-15-env # activate the env
pip install -e .[tests,dev,docs] # install package and development dependencies

Run tests

pytest --cov=src --cov-branch --cov-report=term-missing

build, preview and deploy documentation

quartodoc build
quarto preview

Documentation website update automatically on push to main.

To force update documentation website from the current branch run (probably don’t do this).

quarto publish gh-pages

Contributors

  • Limor Winter
  • Shihan Xu
  • Zaki Aslam
  • Michael Eirikson