Before being able to perform any statistical analysis or machine learning, there are many standard pre-processing techniques required for dataframes. These include:
The laundRy package aims to take care of these mundane but important tasks for you, so you can spend more of your time and energy on the fun stuff! Let us do your dirty laundRy.
You have a dataset that you want to pass in to a machine learning algorithm. You have already separated your target from features, and split the data into training and test sets. You know the dataset has NA values and features of varying types, and additionally, the dataset contains more features than you’d like to train on. This is going to be a headache!
Certain transformations are valid for only certain column types. (e.g. OHE is applicable to categorical features, and scaling is applicable to numerical features).
categorize()
allows you to pass your entire dataframe into a function and creates lists of categorical and numerical features that you can pass into a transformation pipeline.
df <- data.frame(
# numeric, 5 unique values
a = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1),
# numeric, 11 unique values
b = c(1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.3, 0.0),
# factor, 10 unique values
c = c('A','B','C','D','E','F','G','H','I','J','B'))
# categorize with default max_cat (10)
categorize(df)
#> $numeric
#> [1] "b"
#>
#> $categorical
#> [1] "c" "a"
Next, let’s fill NA values in your dataframe. This is like finding those missing socks! You can specify how numerical features are imputed (mean, median), and categorical NA values are imputed by the mode of the feature. The imputed values are calculated from the training set, and applied to NA values in both the training and test set. The training and test sets are passed in, and filled training and test sets are returned.
df_train <- data.frame(a = c(1.5, 2.5, NA, 4.5, 5.5),
b = c(1, 2, 2, 2, NA))
df_test <- data.frame(a = c(6.5, NA, 0),
b = c(0, 1, NA))
fill_missing(df_train, df_test, list(numeric = c('a'),
categorical = c('b')), "mean", "mode" )
#> $x_train
#> a b
#> 1 1.5 1
#> 2 2.5 2
#> 3 3.5 2
#> 4 4.5 2
#> 5 5.5 2
#>
#> $x_test
#> a b
#> 1 6.5 0
#> 2 3.5 1
#> 3 0.0 2
Now that you have parsed your dataframe into categorical and numeric features and filled the missing values,
transform_columns()
is a one-stop shop for applying common transformations to features by their column types. Categorical columns may be one-hot-encoded or ordinal encoded, and a Standard Scaler or MinMax scaler may be applied to numerical columns. Transformations will be fit on the training set, and applied to both the training and test sets. The training and test sets are passed in, and transformed training and test sets are returned.
library(laundRy)
df_train <- data.frame(a = c(1, 2, 3),
b = c(1.2, 3.4, 3.0),
c = c('A','B','C'))
df_test <- data.frame(a = c(6, 2),
b = c(0.5, 9.2),
c = c('B', 'B'))
# One-hot-encode column 'c', scale columns 'a' and 'b'
column_transformer(df_train, df_test, list(numeric = c('a', 'b'), categorical = c('c')))
#> $x_train
#> a b c.B c.C
#> 1 -1 -1.1377602 0 0
#> 2 0 0.7395442 1 0
#> 3 1 0.3982161 0 1
#>
#> $x_test
#> a b c.B c.C
#> 1 4 -1.735084 1 0
#> 2 0 5.688801 1 0
Finally, you have your filled and transformed training and testing datasets. Your test set contains more features than you’d like to use, since you know that some features are better predictors than others. You can use
feature_selection()
to obtain a specified number of the most important features, which you can then use to subset your feature datasets before passing them into a machine learning algorithm. The method used for feature selection is recursive feature selection algorithm. It is based on backward selection process. Under this algorithm, all predictors are used to fit the model then predictors are eliminated based on the rank of importance. For more details, please refer to this link
df <- data.frame(a = c(4, 5, 6),
b = c(1.2, 2.2, 3.2),
c = c(10.4, 0.02, 5.4))
target = c(1, 2, 3)
# Choose the top 2 most important features of df to predict `target`
feature_selection(df, target, mode = 'regression', n_features = 2)
#> [1] a b
#> Levels: a b c
There it is, you have categorized your columns, imputed missing values, transformed your features, and determined which ones are most relevant for your problem. The laundRy is done, it’s time to put on your favourite dress and go dance!
library(laundRy)
X_train <- data.frame(a = c(1, 2, NA, 4, 5, 1, 2, 3, 4, 5),
b = c(1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, NA, 1.4),
c = c(1, 2, 3, 4, 5, 6, 7, 8, 9, NA))
X_test <- data.frame(a = c(3, NA, 2),
b = c(0.5, 9.2, NA),
c = c(NA, 2, 3))
y_train <- c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5)
# Categorize columns
col_list <- categorize(X_train, max_cat = 9)
# Fill missing values
filled_list <- fill_missing(X_train, X_test, col_list, num_imp = 'mean', cat_imp = 'mode')
X_train_filled <- filled_list$x_train
X_test_filled <- filled_list$x_test
# Transform data
transformed_list <- column_transformer(X_train_filled, X_test_filled, col_list)
X_train_transformed <-transformed_list$x_train
X_test_transformed <- transformed_list$x_test
# Select top 2 features
cols <- feature_selection(X_train_transformed, y_train, n_features = 2, mode = 'regression')
X_train <- X_train_transformed[cols]
X_test <- X_test_transformed[cols]
X_train
#> a.3 a.2
#> 1 0 0
#> 2 0 1
#> 3 0 0
#> 4 0 0
#> 5 0 0
#> 6 0 0
#> 7 0 1
#> 8 1 0
#> 9 0 0
#> 10 0 0
X_test
#> a.3 a.2
#> 1 1 0
#> 2 0 0
#> 3 0 1