R-CMD-check

The laundRy package performs many standard preprocessing techniques for Tidyverse tibbles, before use in statistical analysis and machine learning. The package functionality includes categorizing column types, handling missing data and imputation, transforming/standardizing columns and feature selection. The laundRy package aims to remove much of the grunt work in the typical data science workflow, allowing the analyst maximum time and energy to devote to modelling!

View the full documentation and a vignette at the laundRy home page.

Installation

Install the development version from GitHub with:

Features

  • categorize: This function will take in a dataframe and output a list of vectors with names ‘numeric’ and ‘categorical’, each containing the column names associated with the name type.
    • Categorical criteria: All columns of type ‘factor’ or any columns with fewer than a specified number of unique values are considered categorical. A column denoted ‘factor’ overrides the specification for number of unique values.
    • Numeric criteria: All numeric columns that have greater than the specified number of unique values are considered numeric.
  • fill_missing: This function takes in a training feature dataframe, a testing feature dataframe, and a list of column types (like the output of categorical) and imputes missing values based on column type. Missing values in numeric columns may be filled by the mean or median of the training feature dataframe, and categorical columns are filled by the mode of the feature dataframe.

  • column_transformer: This function takes in a training feature dataframe, a testing feature dataframe, and a list of column types (like the output of categorical) and applies pre-processing techniques to each column based on type. Categorical columns will be transformed with a One Hot Encoding (based on the training dataframe) and numerical columns will be scaled (based on the training dataframe).

  • feature_selection: This function takes in a training dataframe a target vector, a target task (Regression or Classification), and a maximum number of features to select. The function returns the most important features to predict the target vector for the target task.

Dependencies

  • caret
  • dplyr
  • rlang
  • stats

LaundRy in the R ecosystem

  • mice offers similar functionality for the fill_missing function, but is not integrated with a column categorizer.

  • The main feature selection and preprocessing package in R is caret, which carries out similar functionality to our feature_selector function though laundRy makes the workflow more efficient and adds imputation.

  • As far as we know, there are no similar packages for Categorizing Columns and providing a list of the categorized columns. laundRy is the first package we are aware of to abstract away the full dataframe pre-processing workflow with a unified and simple API.