The make_recipe()
function is used to quickly apply common data preprocessing techniques
make_recipe( X, y, recipe, splits_to_return = "train_test", random_seed = NULL, train_valid_prop = 0.8 )
X | A dataframe containing training data, validation data, and testing data (should contain X and y). |
---|---|
y | The name of the response column (as a string, e.g. "response_variable"). |
recipe | A string specifying which recipe to apply to the data. See "The recipe parameter" section below for details. |
splits_to_return | A string specifying how to split the data. "train_test" to return train and test splits, "train_test_valid" to return train, test, and validation data, "train" to return all data without splits. |
random_seed | An integer. The random seed to set for splitting data to create reproducible results. By default NULL |
train_valid_prop | A float. The proportion to split the data by. Should range between 0 to 1. By default = 0.8 |
A list of dataframes e.g. list(X_train, X_valid, X_test, y_train, y_valid, y_test)
The following recipes are available currently to pass into the recipe
parameter:
"ohe_and_standard_scaler" - Apply one hot encoding to categorical features and standard scaler to numeric features
More recipes are under development and will be released in future updates.
# apply "ohe_and_standard_scaler" on training and testing data X_example <- dplyr::as_tibble(mtcars) %>% dplyr::mutate( carb = as.factor(carb), gear = as.factor(gear), vs = as.factor(vs), am = as.factor(am) ) y_example <- "gear" make_recipe(X = X_example, y = y_example, recipe = "ohe_and_standard_scaler", splits_to_return = "train_test")#> $X_train #> # A tibble: 26 x 17 #> mpg cyl disp hp drat wt qsec vs_0 vs_1 am_0 am_1 #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.161 -0.0863 -0.578 -0.555 0.730 -0.425 -0.523 1 0 0 1 #> 2 0.461 -1.21 -0.984 -0.799 0.624 -0.981 0.314 0 1 0 1 #> 3 0.228 -0.0863 0.186 -0.555 -0.999 -0.0848 0.751 0 1 1 0 #> 4 -0.221 1.04 0.982 0.377 -0.851 0.141 -0.523 1 0 1 0 #> 5 -0.321 -0.0863 -0.0712 -0.627 -1.67 0.161 1.16 0 1 1 0 #> 6 0.727 -1.21 -0.682 -1.24 0.287 -0.110 1.05 0 1 1 0 #> 7 0.461 -1.21 -0.728 -0.770 0.772 -0.150 2.57 0 1 1 0 #> 8 -0.138 -0.0863 -0.519 -0.369 0.772 0.141 0.151 0 1 1 0 #> 9 -0.371 -0.0863 -0.519 -0.369 0.772 0.141 0.467 0 1 1 0 #> 10 -0.604 1.04 0.325 0.449 -1.02 0.771 -0.323 1 0 1 0 #> # … with 16 more rows, and 6 more variables: carb_1 <dbl>, carb_2 <dbl>, #> # carb_3 <dbl>, carb_4 <dbl>, carb_6 <dbl>, carb_8 <dbl> #> #> $X_valid #> # A tibble: 0 x 0 #> #> $X_test #> # A tibble: 6 x 17 #> mpg cyl disp hp drat wt qsec vs_0 vs_1 am_0 am_1 #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.161 -0.0863 -0.578 -0.555 0.730 -0.681 -0.818 1 0 0 1 #> 2 -0.954 1.04 0.982 1.38 -0.725 0.271 -1.14 1 0 1 0 #> 3 -0.804 1.04 0.325 0.449 -1.02 0.481 -0.00688 1 0 1 0 #> 4 1.73 -1.21 -1.24 -1.39 2.90 -1.69 0.267 0 1 0 1 #> 5 -0.804 1.04 0.545 0.0188 -0.851 0.135 -0.375 1 0 1 0 #> 6 0.993 -1.21 -0.888 -0.828 1.85 -1.16 -0.691 1 0 0 1 #> # … with 6 more variables: carb_1 <dbl>, carb_2 <dbl>, carb_3 <dbl>, #> # carb_4 <dbl>, carb_6 <dbl>, carb_8 <dbl> #> #> $y_train #> # A tibble: 26 x 1 #> gear #> <fct> #> 1 4 #> 2 4 #> 3 3 #> 4 3 #> 5 3 #> 6 4 #> 7 4 #> 8 4 #> 9 4 #> 10 3 #> # … with 16 more rows #> #> $y_valid #> # A tibble: 0 x 0 #> #> $y_test #> # A tibble: 6 x 1 #> gear #> <fct> #> 1 4 #> 2 3 #> 3 3 #> 4 4 #> 5 3 #> 6 5 #>