In the early stage of a data science project, Exploratory Data Analysis (EDA) is a crucial stage to perform an initial investigation on the dataset and inspire valuable research questions.

Rcat is a package that makes it faster and easier to get start on EDA with a collection of convenient functions. This package simplifies the process of detecting and dealing with missing and suspicious values, as well as finding the relevant features.

This document will introduce you to Rcat’s functions and how you can apply them during you EDA process.

Data

To explore the functions in Rcat, we’ll use R’s dataset iris. This dataset contains 150 records and is documented in ?iris.

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
dim(iris)
#> [1] 150   5
summary(iris)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#> 

Functions in the package

  • repwithna(): replaces uninformative strings with NAs
  • misscat(): summarises and deals with missing values
  • suscat(): detects suspected erroneous numeric data
  • topcorr(): finds top-correlated features

Replace uninformative strings with repwithna()

Datasets could include uninformative strings, such as strings with only symbols or blank strings. This function replaces these strings with NAs so they can be removed as missing values.

As there are no uninformative strings in this dataset, we will use this generated data frame for this example.

iris_df <- head(iris)
iris_df[1, 1:3] <- NA
iris_df[3, 5] <- "  "
iris_df[4, 5] <- "???"
iris_df[5, 5] <- "?setosa?"
iris_df
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
#> 1           NA          NA           NA         0.2   setosa
#> 2          4.9         3.0          1.4         0.2   setosa
#> 3          4.7         3.2          1.3         0.2         
#> 4          4.6         3.1          1.5         0.2      ???
#> 5          5.0         3.6          1.4         0.2 ?setosa?
#> 6          5.4         3.9          1.7         0.4   setosa

Defaultly, empty strings will be replaced as NA values.

repwithna(iris_df)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
#> 1           NA          NA           NA         0.2   setosa
#> 2          4.9         3.0          1.4         0.2   setosa
#> 3          4.7         3.2          1.3         0.2     <NA>
#> 4          4.6         3.1          1.5         0.2      ???
#> 5          5.0         3.6          1.4         0.2 ?setosa?
#> 6          5.4         3.9          1.7         0.4   setosa

You can set rmvsym argument to be TRUE to also replace strings with only symbols.

repwithna(iris_df, rmvsym = TRUE)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
#> 1           NA          NA           NA         0.2   setosa
#> 2          4.9         3.0          1.4         0.2   setosa
#> 3          4.7         3.2          1.3         0.2     <NA>
#> 4          4.6         3.1          1.5         0.2     <NA>
#> 5          5.0         3.6          1.4         0.2 ?setosa?
#> 6          5.4         3.9          1.7         0.4   setosa

You can also decide the pattern for all of the strings by passing a regluar expression to format argument. So all the strings that are not following this pattern will be replaced as NAs.

repwithna(iris_df, format="^[?][a-z]+[?]$")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
#> 1           NA          NA           NA         0.2     <NA>
#> 2          4.9         3.0          1.4         0.2     <NA>
#> 3          4.7         3.2          1.3         0.2     <NA>
#> 4          4.6         3.1          1.5         0.2     <NA>
#> 5          5.0         3.6          1.4         0.2 ?setosa?
#> 6          5.4         3.9          1.7         0.4     <NA>

Deal with missing values using misscat()

With uninformative strings replaced as NAs in repwithna(), we can now deal with these missing values together. This function drops rows or columns if the number of the missing values exceeds minimum missing values threshold.

We can continue using the example iris_df data frame above, as it contains NAs. We can set a threshold to drop the row that contains more NAs than the threshold requires.

misscat(iris_df, 0.5)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
#> 2          4.9         3.0          1.4         0.2   setosa
#> 3          4.7         3.2          1.3         0.2         
#> 4          4.6         3.1          1.5         0.2      ???
#> 5          5.0         3.6          1.4         0.2 ?setosa?
#> 6          5.4         3.9          1.7         0.4   setosa

Detect suspected erroneous numeric data with suscat()

Datasets could include erroneous values such as outliers. This function detects suspected erroneous numeric data in user-chosen columns.

You can detect suspected erroneous sepal length and width data in iris by percentage. Here we try to find 1% suspicious values. And we will get the row indices of the questionable values for each column.

suscat(iris, c("Sepal.Length", "Sepal.Width"), n=1, num="percent")
#> $Sepal.Length
#> [1]  14 132
#> 
#> $Sepal.Width
#> [1] 16 61

You can also detect by specifying the exact number of outliers instead of percentage.

suscat(iris, c("Sepal.Length"), n=2, num="number")
#> $Sepal.Length
#> [1]  14 132

Find top-correlated features with topcorr()

During EDA, we always want to know if there are correlated pairs in the dataset, which might give us insights for the analysis process later. The last function calculates the correlation between the columns and generates a list of top-correlated features in the dataset.

For example, you can use it to find the top 2 correlated features in iris:

topcorr(iris, k=5)
#>      Feature 1    Feature 2 Aboslute Correlation
#> 1  Petal.Width Petal.Length               0.9629
#> 2 Petal.Length Sepal.Length               0.8718
#> 3  Petal.Width Sepal.Length               0.8179
#> 4 Petal.Length  Sepal.Width               0.4284
#> 5  Petal.Width  Sepal.Width               0.3661

How Rcat fit in the R ecosystem

There are several existing packages in R that implement similar functionality.

  • SmartEDA This package generates descriptive statistics and visualisations for data frames. A HTML EDA report is also avaliable.

  • DataExplorer This package can analyze and visualize each variable in a data frame. It also includes common data processing methods for wrangling.

  • inspectdf This package offers columnwise summary, comparison and visualisation of data frames.

These packages all provide functions reporting missing values and correlations. Only SmartEDA has a function that runs univariate outlier analysis. And to deal with missing values, only DataExplorer has a function to set all missing values to indicated value.

Thus in R ecosystem, there are many well-defined packages with useful functions for EDA, but there is yet no package containing these different EDA methods. With our package, we hope to incorporate these functions to help the users deal with missing values, outliers and correlations with one simplest way when they are exploring the data set.