nedahelpeR-vignette.Rmd
library(nedahelpeR)
The goal of nedahelpeR is to simplify some common and repetitive tasks during EDA and data preprocessing for data analysts, as well as add value to their workflow by presenting some useful insights in a quick manner (just calling our functions), such as displaying highly-correlated variables and outliers.
The package includes functions which can complete the following tasks:
This is a toy data to show you how to use the package:
library(nedahelpeR)
df <- data.frame('col1'= c(-100,-200, 1,2,3,4,5,6,7,8,9,NA, 1000),
'col2'= c(1,2,3,4,5,6,7,8,9,10,11,12,13),
'col3'= c(-50, 1,2,3,4,5,6,7,8,9,10,11,50000))
df
#> col1 col2 col3
#> 1 -100 1 -50
#> 2 -200 2 1
#> 3 1 3 2
#> 4 2 4 3
#> 5 3 5 4
#> 6 4 6 5
#> 7 5 7 6
#> 8 6 8 7
#> 9 7 9 8
#> 10 8 10 9
#> 11 9 11 10
#> 12 NA 12 11
#> 13 1000 13 50000
missing_imputer()
missing_imputer()
aims to detect missing values in the
numeric data frame and using strategies including drop, mean or median
to drop missing values or to replace them with the mean or median of
other values in the same column. For example, we can impute the missing
value in the first column with the median of other non-NA values in the
same column.
df <- missing_imputer(df, method="median")
df
#> col1 col2 col3
#> 1 -100.0 1 -50
#> 2 -200.0 2 1
#> 3 1.0 3 2
#> 4 2.0 4 3
#> 5 3.0 5 4
#> 6 4.0 6 5
#> 7 5.0 7 6
#> 8 6.0 8 7
#> 9 7.0 9 8
#> 10 8.0 10 9
#> 11 9.0 11 10
#> 12 4.5 12 11
#> 13 1000.0 13 50000
overview()
overview()
calculates common descriptive statistical
values of in the input data. Users can choose the extent of information
that is returned and have the option to use the function as a means to
create statistical variables to be used elsewhere in the
environment.
overview(df, quiet=FALSE)
#> mean median standard.dev variance
#> col1 57.65385 4.5 289.69721 8.392447e+04
#> col2 7.00000 7.0 3.89444 1.516667e+01
#> col3 3847.38462 6.0 13867.14407 1.922977e+08
flag_outliers()
flag_outliers()
helps to display numeric variables which
contain outliers that exceed a certain user-specified threshold
percentage, using the interquartile range method. Users can then take
note of these variables with high percentage of outliers and decide what
to do with the variable(s).
flag_outliers(df, threshold=0.2)
#> # A tibble: 1 × 1
#> col1
#> <dbl>
#> 1 0.231
get_correlated_features()
get_correlated_features()
will get pairs of features
which have correlation above a threshold value. We can specify if we
want to check only the magniture of correlation value or we also want to
consider sign (positive/ negative).
get_correlated_features(df, threshold=0.7)
#> feature_1 feature_2 correlation
#> 1 col1 col3 0.98
#> 2 col3 col1 0.98