The goal of Rmleda is to build a R package called Rmleda with relevant functions to help with preliminary EDA for a given dataset. The package contains functions and classes that help perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.

The Rmleda package will include the following classes/functions:

  • SupervisedData is a function that splits a Data Frame into train and test sets and further into X and y subsets based on a list of user-provided columns.

  • dftype() function will return the type of columns and variables for the input data frame. Furthermore, if there are non-numeric columns, it will return the unique values of non-numeric columns and their length.

  • autoimpute_na() function to identify and impute missing values for different attributes in a given dataframe.

  • dfscaling() function to apply scaling and centering to the numerical features in a given input dataframe.

The Rmleda package is intended to help with EDA for supervised machine learning tasks; there are other existing packages that contain some similar functionality. For example, the packages, namely tidymodels, MICE, Amelia, missForest, Hmisc, and mi, provide users with functions to analyze attribute types, detect missing values and fill them, and scale input data. Our Rmleda package intends to augment the existing functionality of these packages with some additional features and tie it all into a single useful package.

The autoimpute_na() function will additionally detect some common non-standard missing values manually entered by users (e.g., “not available”, “n/a”, “na”, “-”) while identifying and imputing missing data. The output of the autoimpute_na() function will be a dataframe with imputed values.

In supervised machine learning, data splitting is often a multi-step process that involves splitting the dataset of interest into test and train portions and then further into X(features) and y(target class) subsets. Typically, the user has to create and keep track of different variables that hold each of these subsets of the data. supervised_data in Rmleda is a function that builds upon the initial_split function from the tidymodels package. supervised_data provides the user with quick access to appropriately-named attributes referring to each of the aforementioned variables so they can be easily accessed for use in subsequent steps of the machine learning pipeline.

Installation

If you do not have the devtools package, you can install it via CRAN with:

install.packages("devtools")

Then, install Rmleda from GitHub as follows:

devtools::install_github("UBC-MDS/Rmleda")

Lastly, load the package:

library(Rmleda)

Example

  • Check the datatypes and a summary of your input dataframe:
Rmleda::dftype(df)
  • Impute NAs in your input dataframe:
Rmleda::autoimpute_na(df)
  • Apply centering and scaling to the numeric features in your input dataframe:
Rmleda::dfscaling(df, target)
  • Split the data into X train, y train, X test, and y test subsets in one convenient class call using SupervisedData:
super_data <- Rmleda::supervised_data(df, xcols = c('feature1', 'feature2'),ycol = c('target'))