PrepR is a package for R to help preprocessing in machine learning tasks. There are four main functions included in this package that fit into four important steps in general data preprocessing pipeline.

library(PrepR)

Load Data

To explore functions in PrepR, let’s first load some toy data. Here we have one categorical data and two numerical data

fruits_df <- data.frame(fruits = c("apple", "pear", "apple", "banana", "orange"), count = c(1L, 2L, 3L, 4L, 5L), price = c(2, 3.4, 0.5, 7, 8))
fruits_y <- data.frame(target = c(1, 2, 3, 4, 5))

Train Test Split

The first thing to do in a machine learning pipeline is to split the data in to train, validation, and test datasets. We will demonstrating this using the function train_valid_test_split.

x_train <- train_valid_test_split(fruits_df, fruits_y)$x_train
x_valid <- train_valid_test_split(fruits_df, fruits_y)$x_valid
x_test <- train_valid_test_split(fruits_df, fruits_y)$x_test
y_train <- train_valid_test_split(fruits_df, fruits_y)$y_train
y_valid <- train_valid_test_split(fruits_df, fruits_y)$y_valid
y_test <- train_valid_test_split(fruits_df, fruits_y)$y_test
x_train
#>   fruits count price
#> 5 orange     5   8.0
#> 2   pear     2   3.4
#> 3  apple     3   0.5

Distinguish Data Types

Next we will need to know variable types of the columns in order to do further transformation or analysis on them. Let’s try using our data_type function to split categorical and numeric data.

numeric_variables_train <- PrepR::data_type(x_train)$num
numeric_variables_valid <- PrepR::data_type(x_valid)$num
numeric_variables_test <- PrepR::data_type(x_test)$num
categorical_variables <- PrepR::data_type(x_train)$cat
categorical_variables
#>   fruits
#> 5 orange
#> 2   pear
#> 3  apple

One Hot Encode

One hot encoding is a powerful transformation that allows us to work with categorical data. Let’s take a look at what our onehot does.

onehot(categorical_variables)
#>   apple banana orange pear
#> 1     0      0      1    0
#> 2     0      0      0    1
#> 3     1      0      0    0

Numeric Scaling

Now we’re left with our numeric variables only. We usually want to normalize our numeric variables for better modelling performance, for example for k-Nearest Neighbours algorithm. The scaler function in our package will do this for you.

scaler(numeric_variables_train, numeric_variables_valid, numeric_variables_test, c("count", "price"))$X_train
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
#> ✓ tibble  2.1.3     ✓ dplyr   0.8.3
#> ✓ tidyr   1.0.0     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.4.0
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#> 
#>     matches
#> The following object is masked from 'package:purrr':
#> 
#>     is_null
#> The following object is masked from 'package:tidyr':
#> 
#>     matches
#>        count      price
#> 5  1.0910895  1.0664622
#> 2 -0.8728716 -0.1498335
#> 3 -0.2182179 -0.9166287