Rmleda-package-usage.Rmd
This vignette describes the use of the supervised_data()
, dftype()
, autoimpute_na()
, and dfscaling()
functions included in package Rmleda
. The Rmleda
package helps with preliminary EDA for a given dataset to perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.
If you do not have the devtools
package you can install it via cran
install.packages("devtools")
Then install Rmleda from GitHub as follows
devtools::install_github("UBC-MDS/Rmleda")
Lastly, load the package
library(Rmleda)
To explore the functionality of Rmleda, we will use the following fictional dataset on chocolates!
toy_df <- data.frame(
chocolate_brand = c('Lindt', 'Rakhat', 'Lindt', 'Richart', 'Not Available'),
price = c(3, NA, 4, 6, 3),
type = c("dark", "dark", "white", "white", "dark")
)
As a first step, you can use the dftype()
function to take a look at data type information, summary statistics (for numeric columns), and unique values (for non-numeric columns).
The function returns a list of results. The first entity (summary
) is a summary of each column. The second entity (unique_df
) is a data frame that contains unique values for each non-numeric column name and its frequency.
dftype(toy_df)
#> $summary
#> chocolate_brand price type
#> Length:5 Min. :3.0 Length:5
#> Class :character 1st Qu.:3.0 Class :character
#> Mode :character Median :3.5 Mode :character
#> Mean :4.0
#> 3rd Qu.:4.5
#> Max. :6.0
#> NA's :1
#>
#> $unique_df
#> column_name unique_values num_unique_values
#> 1 chocolate_brand Lindt,Rakhat,Richart,Not Available 4
#> 2 type dark,white 2
We can access summary
separately
dftype(toy_df)$summary
#> chocolate_brand price type
#> Length:5 Min. :3.0 Length:5
#> Class :character 1st Qu.:3.0 Class :character
#> Mode :character Median :3.5 Mode :character
#> Mean :4.0
#> 3rd Qu.:4.5
#> Max. :6.0
#> NA's :1
And likewise unique_df
dftype(toy_df)$unique_df
#> column_name unique_values num_unique_values
#> 1 chocolate_brand Lindt,Rakhat,Richart,Not Available 4
#> 2 type dark,white 2
Our toy dataset seemingly contains missing values in chocolate_brand
and price
columns.
To impute missing values in the toy_df
we can make use of autoimpute_na()
. This function identifies and imputes missing values in a given dataframe based on the types of the columns, i.e. the function fills missing values with the mean for numeric columns and the most frequent value for non-numeric columns.
toy_df
#> chocolate_brand price type
#> 1 Lindt 3 dark
#> 2 Rakhat NA dark
#> 3 Lindt 4 white
#> 4 Richart 6 white
#> 5 Not Available 3 dark
(toy_df <- autoimpute_na(toy_df))
#> chocolate_brand price type
#> 1 Lindt 3 dark
#> 2 Rakhat 4 dark
#> 3 Lindt 4 white
#> 4 Richart 6 white
#> 5 Lindt 3 dark
As you can see from the above results, the autoimpute_na()
function detects some common non-standard missing values manually entered by users (e.g., “not available”, “n/a”, “na”, “-”) while identifying and imputing missing data. The output of the autoimpute_na()
function will be a dataframe with imputed values.
Scaling numeric features is an important and frequently applied step in a supervised machine learning workflow. dfscaling()
can automatically identify and scale all numeric features in the dataset.
The function takes two arguments: the input dataframe and the name of the target or label column for the supervised machine learning task. In the toy_df dataset type
is the target column
dfscaling(toy_df, type)
#> Warning in FUN(newX[, i], ...): NAs introduced by coercion
#> Warning in FUN(newX[, i], ...): NAs introduced by coercion
#> # A tibble: 5 x 3
#> chocolate_brand price type
#> <fct> <dbl> <fct>
#> 1 Lindt -0.816 dark
#> 2 Rakhat 0 dark
#> 3 Lindt 0 white
#> 4 Richart 1.63 white
#> 5 Lindt -0.816 dark
As seen above, the function applies standard scaling and centering. Each of the numeric columns will have a mean of 0 and standard deviation of 1 after the transformation. All columns with zero-variance are excluded prior to applying this transformation.
After performing the above pre-processing tasks, we are ready to move on perform data splitting. Instead of manually creating variables to hold train_set, test_set, x_train, y_train, x_test and y_test, we can simply use the supervised_data
function. We pass it toy_df
along with the names of x
features or columns and the name of the y
target column as below.
The function automatically calls initial_split() from the tidymodels
package with the default arguments and returns easy-access variables for train portion of the dataset (train
), the test portion of the dataset (test
), the train portion of the dataset containing X
features only (x_train
), the train portion of the dataset containing y
targets only (y_train
), the test portion of the dataset containing X
features only (x_test
), and the test portion of the dataset containing y
targets only (y_test
). You can also access the original dataset from the supervised_data object using the variable name data
.
super_data <-
supervised_data(toy_df,
xcols = c('chocolate_brand', 'type'),
ycol = c('price'))
For example, we can quickly access the x_train portion
super_data$train
#> chocolate_brand price type
#> 1 Lindt 3 dark
#> 2 Rakhat 4 dark
#> 3 Lindt 4 white
#> 4 Richart 6 white
and with just the x columns/feature
super_data$x_train
#> chocolate_brand type
#> 1 Lindt dark
#> 2 Rakhat dark
#> 3 Lindt white
#> 4 Richart white
Similarly we can access the other subsets of the data
super_data$data # train portion of the dataset
super_data$train # test portion of the dataset
super_data$test # test portion of the dataset
super_data$x_train # train portion of the dataset containing `x` targets only
super_data$y_train # train portion of the dataset containing `y` targets only
super_data$x_test # test portion of the dataset containing `x` targets only
super_data$y_test # test portion of the dataset containing `y` targets only
Finally, we can customize the split by passing extra arguments to supervised_data
. These will be passed to initial_split()
from the tidymodels
package. See the documentation for more information.
super_data <-
supervised_data(
toy_df,
xcols = c('chocolate_brand', 'type'),
ycol = c('price'),
prop = 1 / 2 # Change split ratio to 0.5
)
And access the subsets in the same manner described previously
# Note for datasets with sizes that result in non-integer splits,
# the training portion size will be rounded up and the test portion size will
# be rounded down as per initial_model() implementation from tidymodels
super_data$train
#> chocolate_brand price type
#> 1 Lindt 3 dark
#> 2 Rakhat 4 dark
#> 5 Lindt 3 dark
super_data$test
#> chocolate_brand price type
#> 3 Lindt 4 white
#> 4 Richart 6 white