library(nedahelpeR)

Introduction to nedahelpeR

The goal of nedahelpeR is to simplify some common and repetitive tasks during EDA and data preprocessing for data analysts, as well as add value to their workflow by presenting some useful insights in a quick manner (just calling our functions), such as displaying highly-correlated variables and outliers.

The package includes functions which can complete the following tasks:

  • Handle missing values
  • Display some useful statistics
  • Detect outliers
  • Check for correlation between features

Data

This is a toy data to show you how to use the package:

library(nedahelpeR)
df <- data.frame('col1'= c(-100,-200, 1,2,3,4,5,6,7,8,9,NA, 1000), 
                'col2'= c(1,2,3,4,5,6,7,8,9,10,11,12,13),
                'col3'= c(-50, 1,2,3,4,5,6,7,8,9,10,11,50000))
df
#>    col1 col2  col3
#> 1  -100    1   -50
#> 2  -200    2     1
#> 3     1    3     2
#> 4     2    4     3
#> 5     3    5     4
#> 6     4    6     5
#> 7     5    7     6
#> 8     6    8     7
#> 9     7    9     8
#> 10    8   10     9
#> 11    9   11    10
#> 12   NA   12    11
#> 13 1000   13 50000

Handle Missing Values with missing_imputer()

missing_imputer() aims to detect missing values in the numeric data frame and using strategies including drop, mean or median to drop missing values or to replace them with the mean or median of other values in the same column. For example, we can impute the missing value in the first column with the median of other non-NA values in the same column.

df <- missing_imputer(df, method="median")
df
#>      col1 col2  col3
#> 1  -100.0    1   -50
#> 2  -200.0    2     1
#> 3     1.0    3     2
#> 4     2.0    4     3
#> 5     3.0    5     4
#> 6     4.0    6     5
#> 7     5.0    7     6
#> 8     6.0    8     7
#> 9     7.0    9     8
#> 10    8.0   10     9
#> 11    9.0   11    10
#> 12    4.5   12    11
#> 13 1000.0   13 50000

Display common statistical values with overview()

overview() calculates common descriptive statistical values of in the input data. Users can choose the extent of information that is returned and have the option to use the function as a means to create statistical variables to be used elsewhere in the environment.

overview(df, quiet=FALSE)
#>            mean median standard.dev     variance
#> col1   57.65385    4.5    289.69721 8.392447e+04
#> col2    7.00000    7.0      3.89444 1.516667e+01
#> col3 3847.38462    6.0  13867.14407 1.922977e+08

Detect outliers with flag_outliers()

flag_outliers() helps to display numeric variables which contain outliers that exceed a certain user-specified threshold percentage, using the interquartile range method. Users can then take note of these variables with high percentage of outliers and decide what to do with the variable(s).

flag_outliers(df, threshold=0.2)
#> # A tibble: 1 × 1
#>    col1
#>   <dbl>
#> 1 0.231

Check for correlation between features with get_correlated_features()

get_correlated_features() will get pairs of features which have correlation above a threshold value. We can specify if we want to check only the magniture of correlation value or we also want to consider sign (positive/ negative).

get_correlated_features(df, threshold=0.7)
#>   feature_1 feature_2 correlation
#> 1      col1      col3        0.98
#> 2      col3      col1        0.98

Key Advantages

  • This package is implemented with the functions to do both EDA and data preprocessing.
  • Much more light-weighted compared to most other EDA packages.
  • Lots of flexibility and customization available.