Overview

This vignette describes the use of the supervised_data(), dftype(), autoimpute_na(), and dfscaling() functions included in package Rmleda. The Rmleda package helps with preliminary EDA for a given dataset to perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.

Installation

If you do not have the devtools package you can install it via cran

install.packages("devtools")

Then install Rmleda from GitHub as follows

devtools::install_github("UBC-MDS/Rmleda")

Lastly, load the package

library(Rmleda)

Dataset

To explore the functionality of Rmleda, we will use the following fictional dataset on chocolates!

toy_df <- data.frame(
  chocolate_brand = c('Lindt', 'Rakhat', 'Lindt', 'Richart', 'Not Available'),
  price = c(3, NA, 4, 6, 3),
  type = c("dark", "dark", "white", "white", "dark")
)

Data type exploration

As a first step, you can use the dftype() function to take a look at data type information, summary statistics (for numeric columns), and unique values (for non-numeric columns).

The function returns a list of results. The first entity (summary) is a summary of each column. The second entity (unique_df) is a data frame that contains unique values for each non-numeric column name and its frequency.

dftype(toy_df)
#> $summary
#>  chocolate_brand        price         type          
#>  Length:5           Min.   :3.0   Length:5          
#>  Class :character   1st Qu.:3.0   Class :character  
#>  Mode  :character   Median :3.5   Mode  :character  
#>                     Mean   :4.0                     
#>                     3rd Qu.:4.5                     
#>                     Max.   :6.0                     
#>                     NA's   :1                       
#> 
#> $unique_df
#>       column_name                      unique_values num_unique_values
#> 1 chocolate_brand Lindt,Rakhat,Richart,Not Available                 4
#> 2            type                         dark,white                 2

We can access summary separately

dftype(toy_df)$summary
#>  chocolate_brand        price         type          
#>  Length:5           Min.   :3.0   Length:5          
#>  Class :character   1st Qu.:3.0   Class :character  
#>  Mode  :character   Median :3.5   Mode  :character  
#>                     Mean   :4.0                     
#>                     3rd Qu.:4.5                     
#>                     Max.   :6.0                     
#>                     NA's   :1

And likewise unique_df

dftype(toy_df)$unique_df
#>       column_name                      unique_values num_unique_values
#> 1 chocolate_brand Lindt,Rakhat,Richart,Not Available                 4
#> 2            type                         dark,white                 2

Identifying and imputing missing values

Our toy dataset seemingly contains missing values in chocolate_brand and price columns.

To impute missing values in the toy_df we can make use of autoimpute_na(). This function identifies and imputes missing values in a given dataframe based on the types of the columns, i.e. the function fills missing values with the mean for numeric columns and the most frequent value for non-numeric columns.

toy_df
#>   chocolate_brand price  type
#> 1           Lindt     3  dark
#> 2          Rakhat    NA  dark
#> 3           Lindt     4 white
#> 4         Richart     6 white
#> 5   Not Available     3  dark

(toy_df <- autoimpute_na(toy_df))
#>   chocolate_brand price  type
#> 1           Lindt     3  dark
#> 2          Rakhat     4  dark
#> 3           Lindt     4 white
#> 4         Richart     6 white
#> 5           Lindt     3  dark

As you can see from the above results, the autoimpute_na() function detects some common non-standard missing values manually entered by users (e.g., “not available”, “n/a”, “na”, “-”) while identifying and imputing missing data. The output of the autoimpute_na() function will be a dataframe with imputed values.

Scaling and centering numeric columns

Scaling numeric features is an important and frequently applied step in a supervised machine learning workflow. dfscaling() can automatically identify and scale all numeric features in the dataset.

The function takes two arguments: the input dataframe and the name of the target or label column for the supervised machine learning task. In the toy_df dataset type is the target column

dfscaling(toy_df, type)
#> Warning in FUN(newX[, i], ...): NAs introduced by coercion

#> Warning in FUN(newX[, i], ...): NAs introduced by coercion
#> # A tibble: 5 x 3
#>   chocolate_brand  price type 
#>   <fct>            <dbl> <fct>
#> 1 Lindt           -0.816 dark 
#> 2 Rakhat           0     dark 
#> 3 Lindt            0     white
#> 4 Richart          1.63  white
#> 5 Lindt           -0.816 dark

As seen above, the function applies standard scaling and centering. Each of the numeric columns will have a mean of 0 and standard deviation of 1 after the transformation. All columns with zero-variance are excluded prior to applying this transformation.

Data splitting

After performing the above pre-processing tasks, we are ready to move on perform data splitting. Instead of manually creating variables to hold train_set, test_set, x_train, y_train, x_test and y_test, we can simply use the supervised_data function. We pass it toy_df along with the names of x features or columns and the name of the y target column as below.

The function automatically calls initial_split() from the tidymodels package with the default arguments and returns easy-access variables for train portion of the dataset (train), the test portion of the dataset (test), the train portion of the dataset containing X features only (x_train), the train portion of the dataset containing y targets only (y_train), the test portion of the dataset containing X features only (x_test), and the test portion of the dataset containing y targets only (y_test). You can also access the original dataset from the supervised_data object using the variable name data.

super_data <-
  supervised_data(toy_df,
                  xcols = c('chocolate_brand', 'type'),
                  ycol = c('price'))

For example, we can quickly access the x_train portion

super_data$train
#>   chocolate_brand price  type
#> 1           Lindt     3  dark
#> 2          Rakhat     4  dark
#> 3           Lindt     4 white
#> 4         Richart     6 white

and with just the x columns/feature

super_data$x_train
#>   chocolate_brand  type
#> 1           Lindt  dark
#> 2          Rakhat  dark
#> 3           Lindt white
#> 4         Richart white

Similarly we can access the other subsets of the data

super_data$data      # train portion of the dataset
super_data$train     # test portion of the dataset
super_data$test      # test portion of the dataset
super_data$x_train   # train portion of the dataset containing `x` targets only
super_data$y_train   # train portion of the dataset containing `y` targets only
super_data$x_test    # test portion of the dataset containing `x` targets only
super_data$y_test    # test portion of the dataset containing `y` targets only

Finally, we can customize the split by passing extra arguments to supervised_data. These will be passed to initial_split() from the tidymodels package. See the documentation for more information.

super_data <-
  supervised_data(
    toy_df,
    xcols = c('chocolate_brand', 'type'),
    ycol = c('price'),
    prop = 1 / 2 # Change split ratio to 0.5
  ) 

And access the subsets in the same manner described previously

# Note for datasets with sizes that result in non-integer splits, 
# the training portion size will be rounded up and the test portion size will
# be rounded down as per initial_model() implementation from tidymodels
super_data$train
#>   chocolate_brand price type
#> 1           Lindt     3 dark
#> 2          Rakhat     4 dark
#> 5           Lindt     3 dark
super_data$test
#>   chocolate_brand price  type
#> 3           Lindt     4 white
#> 4         Richart     6 white