PrepR
is a package for R to help preprocessing in machine learning tasks. There are certain repetitive tasks that come up often when doing a machine learning project and this package aims to alleviate those chores. Some of the issues that come up regularly are: finding the types of each column in a dataframe, splitting the data (whether into train/test sets or train/test/validation sets, one-hot encoding, and scaling features. This package will help with all of those tasks.
You can install the released version of PrepR from CRAN with:
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/PrepR")
This package has the following features:
train_valid_test_split
: This function splits the data set into train, validation, and test sets.
data_type
: This function identifies data types for each column/feature. It returns one dataframe for each type of data.
one-hot
: This function performs one-hot encoding on the categorical features and returns a dataframe with sensible column names.
scaler
: This function performs standard scaling on the numerical features.
Identify features of different data types
data_type(my_data)['num']
data_type(my_data)['cat']
One-hot encode features of categorical type
one_hot(my_data)
Train, validation, and test split
train_valid_test_split(X, y, test_size, valid_size, train_size, stratify, random_state, shuffle)
Standard Scaling of categorical featuresX_train = scaler(x_train, x_test, colnames)['x_train']
X_test = scaler(x_train, x_test, colnames)['x_test']
There are no packages that do everything that this one does, but there are packages for machine learning in R that this package will make use of. The caret
package in R does some of these preprocessing steps, such as train/test split. It does not, however, have a function that takes a dataframe and returns multiple dataframes based on their column type; this is the issue that this package’s data_type
function seeks to solve. However, it does not seem that there is an option for having a validation set in caret
’s version of train/test split. This is an option that would be useful to anyone doing machine learning. The caret
package also does one-hot encoding with the function dummyVars
, though onehot()
in the PrepR package is more intuitive: its function does not return meaningful, human-readable column names. It also removes one column by default, which is better for fast computation, but worse for human-readability.
Overall, this package fits in well with the R ecosystem and helps make machine learning a little easier.
The official documentation is hosted on Read the Docs: https://PrepR.readthedocs.io/en/latest/
This package was created using the Whole Game Chapter from the R Packages handbook by Hadley Wickham