In the early stage of a data science project, Exploratory Data Analysis (EDA) is a crucial stage to perform an initial investigation on the dataset and inspire valuable research questions.
Rcat
is a package that makes it faster and easier to get start on EDA with a collection of convenient functions. This package simplifies the process of detecting and dealing with missing and suspicious values, as well as finding the relevant features.
This document will introduce you to Rcat
’s functions and how you can apply them during you EDA process.
To explore the functions in Rcat
, we’ll use R’s dataset iris
. This dataset contains 150 records and is documented in ?iris
.
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
dim(iris)
#> [1] 150 5
summary(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
repwithna()
: replaces uninformative strings with NAsmisscat()
: summarises and deals with missing valuessuscat()
: detects suspected erroneous numeric datatopcorr()
: finds top-correlated featuresrepwithna()
Datasets could include uninformative strings, such as strings with only symbols or blank strings. This function replaces these strings with NAs so they can be removed as missing values.
As there are no uninformative strings in this dataset, we will use this generated data frame for this example.
iris_df <- head(iris)
iris_df[1, 1:3] <- NA
iris_df[3, 5] <- " "
iris_df[4, 5] <- "???"
iris_df[5, 5] <- "?setosa?"
iris_df
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 NA NA NA 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2 ???
#> 5 5.0 3.6 1.4 0.2 ?setosa?
#> 6 5.4 3.9 1.7 0.4 setosa
Defaultly, empty strings will be replaced as NA values.
repwithna(iris_df)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 NA NA NA 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 <NA>
#> 4 4.6 3.1 1.5 0.2 ???
#> 5 5.0 3.6 1.4 0.2 ?setosa?
#> 6 5.4 3.9 1.7 0.4 setosa
You can set rmvsym
argument to be TRUE to also replace strings with only symbols.
repwithna(iris_df, rmvsym = TRUE)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 NA NA NA 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 <NA>
#> 4 4.6 3.1 1.5 0.2 <NA>
#> 5 5.0 3.6 1.4 0.2 ?setosa?
#> 6 5.4 3.9 1.7 0.4 setosa
You can also decide the pattern for all of the strings by passing a regluar expression to format
argument. So all the strings that are not following this pattern will be replaced as NAs.
repwithna(iris_df, format="^[?][a-z]+[?]$")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 NA NA NA 0.2 <NA>
#> 2 4.9 3.0 1.4 0.2 <NA>
#> 3 4.7 3.2 1.3 0.2 <NA>
#> 4 4.6 3.1 1.5 0.2 <NA>
#> 5 5.0 3.6 1.4 0.2 ?setosa?
#> 6 5.4 3.9 1.7 0.4 <NA>
misscat()
With uninformative strings replaced as NAs in repwithna()
, we can now deal with these missing values together. This function drops rows or columns if the number of the missing values exceeds minimum missing values threshold.
We can continue using the example iris_df
data frame above, as it contains NAs. We can set a threshold to drop the row that contains more NAs than the threshold requires.
misscat(iris_df, 0.5)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2 ???
#> 5 5.0 3.6 1.4 0.2 ?setosa?
#> 6 5.4 3.9 1.7 0.4 setosa
suscat()
Datasets could include erroneous values such as outliers. This function detects suspected erroneous numeric data in user-chosen columns.
You can detect suspected erroneous sepal length and width data in iris
by percentage. Here we try to find 1% suspicious values. And we will get the row indices of the questionable values for each column.
suscat(iris, c("Sepal.Length", "Sepal.Width"), n=1, num="percent")
#> $Sepal.Length
#> [1] 14 132
#>
#> $Sepal.Width
#> [1] 16 61
You can also detect by specifying the exact number of outliers instead of percentage.
There are several existing packages in R that implement similar functionality.
SmartEDA This package generates descriptive statistics and visualisations for data frames. A HTML EDA report is also avaliable.
DataExplorer This package can analyze and visualize each variable in a data frame. It also includes common data processing methods for wrangling.
inspectdf This package offers columnwise summary, comparison and visualisation of data frames.
These packages all provide functions reporting missing values and correlations. Only SmartEDA has a function that runs univariate outlier analysis. And to deal with missing values, only DataExplorer has a function to set all missing values to indicated value.
Thus in R ecosystem, there are many well-defined packages with useful functions for EDA, but there is yet no package containing these different EDA methods. With our package, we hope to incorporate these functions to help the users deal with missing values, outliers and correlations with one simplest way when they are exploring the data set.