library(redahelper)

Introduction to redahelper

When conducting an exploratory data analysis, users must:

Identify outliers, and decide whether to remove them or not.
Generate plots to explore the relationships between variables.
Find correlations between variables via a correlation coefficient.
Identify and impute missing data.

The redahelper package aims to provide a more user-friendly experience through the following:

It identifies outliers in columns of a dataframe, and aims to make the outlier-removal-decision process more streamlined by providing information about the outliers and what % of the column contains outliers.
It automatically generates plots of specified variables and calculates correlation coefficients.
It identifies and imputes missing data with a user-selected method.

This document outliers redahelper’s toolkit, and provides examples on how to use them.

Data: airquality

To explore the various functions of the redahelper package, this vignette wil luse the airquality dataset that is part of the base-R datasets. It is documented in ?airquality.

summary(airquality)
#>      Ozone           Solar.R           Wind             Temp      
#>  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
#>  NA's   :37       NA's   :7                                       
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

dim(airquality)
#> [1] 153   6

head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

Identifying Outliers with `fast_outliers_id()`

fast_outliers_id() allows the user to identify outliers in the data. The arguments are the dataframe (or tibble), the columns the user wants information on, the method of identifying outliers (either z-score or interquartile, and the threshold for evaluating outliers in categorical columns). The output is a dataframe summarizing the outliers in the dataframe, allowing the user to make the choice on how to proceed.

For example, finding outlier data with a method of z-score for all columns in the dataframe:

fast_outlier_id(data = airquality, cols = "ALL", method = "z-score", threshold_low_freq = 0.05)
#> # A tibble: 6 x 8
#>   column_name type  no_nans perc_nans outlier_method no_outliers perc_outliers
#>   <chr>       <chr>   <int>     <dbl> <list>               <int> <list>       
#> 1 Ozone       inte~      37      0.24 <chr [1]>                6 <dbl [1]>    
#> 2 Solar.R     inte~       7      0.05 <chr [1]>                0 <dbl [1]>    
#> 3 Wind        nume~       0      0    <chr [1]>                6 <dbl [1]>    
#> 4 Temp        inte~       0      0    <chr [1]>                7 <dbl [1]>    
#> 5 Month       inte~       0      0    <chr [1]>                0 <dbl [1]>    
#> 6 Day         inte~       0      0    <chr [1]>                0 <dbl [1]>    
#> # ... with 1 more variable: outlier_values <list>

Plotting variables with `fast_plot()`

fast_plot() allows a user to create an exploratory data analysis plot using two columns from the dataframe (or tibble) using ggplot2. The arguments are the dataframe (or tibble) of interest, the two columns of interest (x and y axes), and the type of plot to be generated, a choice from scatter, line, or bar. The output is a ggplot2 plot of the selected columns using the selected plot type. The function contains error handling to ensure the user is selecting an appropriate plot (e.g. will not allow for a bar chart when both x and y are non-numeric).

For example, plotting a scatter plot of Ozone and Temp:

fast_plot(df = airquality, x = "Ozone" , y = "Temp", plot_type = "scatter")

Another example, a bar plot of the Wind by Month:

fast_plot(df = airquality, x = "Month" , y = "Wind", plot_type = "bar")

Exploring correlations with `fast_corr()`

fast_corr() allows the user to view the Pearson correlation coefficient of selected variables in the dataframe or tibble. The arguments are the dataframe (or tibble) and the columns of interest. The output is a correlation matrix displaying the various correlations.

For example, assessing the correlations among Wind, Temp, and Month:

fast_corr(df = airquality, selected_columns = c("Wind", "Temp", "Month"))

Identify and impute missing data with `fast_missing_impute()`

fast_missing_impute() allows for a user to impute missing values from a column of the dataframe/tibble with a selected method. The arguments are the the dataframe(or tibble), the method of imputation (mean, median, mode, or remove to remove all rows with missing data in the selected columns), and the columns of interest. The output is a dataframe with the imputed values.

For example, imputing the missing values in Ozone and Solar.R with the median of the respective columns:

imputed_median <- fast_missing_impute(df = airquality, method = "median", cols = c("Ozone", "Solar.R"))
head(imputed_median)
#>   Ozone Solar.R Wind Temp Month Day
#> 1  41.0     190  7.4   67     5   1
#> 2  36.0     118  8.0   72     5   2
#> 3  12.0     149 12.6   74     5   3
#> 4  18.0     313 11.5   62     5   4
#> 5  31.5     205 14.3   56     5   5
#> 6  28.0     205 14.9   66     5   6

Another example, using the same columns, but now removing the rows in Ozone and Solar.R with missing data:

imputed_remove <- fast_missing_impute(df = airquality, method = "remove", cols = c("Ozone", "Solar.R"))
head(imputed_remove)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 7    23     299  8.6   65     5   7
#> 8    19      99 13.8   59     5   8

Comparisons

Compared to existing options, redahelper:

Allows for a more streamlined user experience as many common exploratory data analysis issues can be taken care of with one line of code.
Does not modify the existing dataframe to allow the user to make decisions about whether they would like to make use of the output.
Is well documented and includes considerable error handling.

redahelper-vignette

Introduction to redahelper

Data: airquality

Identifying Outliers with `fast_outliers_id()`

Plotting variables with `fast_plot()`

Exploring correlations with `fast_corr()`

Identify and impute missing data with `fast_missing_impute()`

Comparisons

Contents

redahelper-vignette

Introduction to redahelper

Data: airquality

Identifying Outliers with fast_outliers_id()

Plotting variables with fast_plot()

Exploring correlations with fast_corr()

Identify and impute missing data with fast_missing_impute()

Comparisons

Contents

Identifying Outliers with `fast_outliers_id()`

Plotting variables with `fast_plot()`

Exploring correlations with `fast_corr()`

Identify and impute missing data with `fast_missing_impute()`