is a package for R to help preprocessing in machine learning tasks. There are four main functions included in this package that fit into four important steps in general data preprocessing pipeline.
To explore functions in PrepR
, let’s first load some toy data. Here we have one categorical data and two numerical data
fruits_df <- data.frame(fruits = c("apple", "pear", "apple", "banana", "orange"), count = c(1L, 2L, 3L, 4L, 5L), price = c(2, 3.4, 0.5, 7, 8))
fruits_y <- data.frame(target = c(1, 2, 3, 4, 5))
The first thing to do in a machine learning pipeline is to split the data in to train, validation, and test datasets. We will demonstrating this using the function train_valid_test_split
x_train <- train_valid_test_split(fruits_df, fruits_y)$x_train
x_valid <- train_valid_test_split(fruits_df, fruits_y)$x_valid
x_test <- train_valid_test_split(fruits_df, fruits_y)$x_test
y_train <- train_valid_test_split(fruits_df, fruits_y)$y_train
y_valid <- train_valid_test_split(fruits_df, fruits_y)$y_valid
y_test <- train_valid_test_split(fruits_df, fruits_y)$y_test
#> fruits count price
#> 5 orange 5 8.0
#> 2 pear 2 3.4
#> 3 apple 3 0.5
Next we will need to know variable types of the columns in order to do further transformation or analysis on them. Let’s try using our data_type
function to split categorical and numeric data.
One hot encoding is a powerful transformation that allows us to work with categorical data. Let’s take a look at what our onehot
#> apple banana orange pear
#> 1 0 0 1 0
#> 2 0 0 0 1
#> 3 1 0 0 0
Now we’re left with our numeric variables only. We usually want to normalize our numeric variables for better modelling performance, for example for k-Nearest Neighbours algorithm. The scaler
function in our package will do this for you.
scaler(numeric_variables_train, numeric_variables_valid, numeric_variables_test, c("count", "price"))$X_train
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
#> ✓ tibble 2.1.3 ✓ dplyr 0.8.3
#> ✓ tidyr 1.0.0 ✓ stringr 1.4.0
#> ✓ readr 1.3.1 ✓ forcats 0.4.0
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#> matches
#> The following object is masked from 'package:purrr':
#> is_null
#> The following object is masked from 'package:tidyr':
#> matches
#> count price
#> 5 1.0910895 1.0664622
#> 2 -0.8728716 -0.1498335
#> 3 -0.2182179 -0.9166287