collinearityR-vignette • collinearityR

Identify multicollinearity issues by correlation, VIF, and visualizations. The collinearityR package is designed for beginners of R who want to identify multicollinearity issues by applying a simple function. It automates the process of building a proper correlation matrix, creating correlation heat map and identifying pairwise highly correlated variables.

This document introduces you to collinearityR’s basic set of tools and demonstrates how to apply them to data frames.

Data: mpg

We will use the data set mpg from the ggplot2 package to explore the multicollinearity tools of collinearityR. This dataset contains 234 observations and 11 variables.

data <- ggplot2::mpg
dim(data)
#> [1] 234  11
data
#> # A tibble: 234 x 11
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
#>  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
#>  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
#>  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
#>  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
#>  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
#>  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
#>  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
#>  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
#> 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
#> # ... with 224 more rows

Correlation Matrix and its Longer Form

corr_matrix() allows you to calculate the Pearson correlation coefficients for all numeric variables. Moreover, you can round the outcome to the desired decimals. The output is a generic correlation matrix and its longer form, so you can decide which form you can use to directly plot a heatmap without further data manipulation.

For example, we can calculate the correlation matrix and its longer form using all the numerical columns in mpg.

library(collinearityR)

corr_matrix(data, decimals = 2)[1]
#> [[1]]
#> # A tibble: 25 x 4
#>    variable1 variable2 correlation rounded_corr
#>    <chr>     <chr>           <dbl>        <dbl>
#>  1 displ     displ         1               1   
#>  2 displ     year          0.148           0.15
#>  3 displ     cyl           0.930           0.93
#>  4 displ     cty          -0.799          -0.8 
#>  5 displ     hwy          -0.766          -0.77
#>  6 year      displ         0.148           0.15
#>  7 year      year          1               1   
#>  8 year      cyl           0.122           0.12
#>  9 year      cty          -0.0372         -0.04
#> 10 year      hwy           0.00216         0   
#> # ... with 15 more rows
corr_matrix(data, decimals = 2)[2]
#> [[1]]
#>            displ         year        cyl         cty          hwy
#> displ  1.0000000  0.147842816  0.9302271 -0.79852397 -0.766020021
#> year   0.1478428  1.000000000  0.1222453 -0.03723229  0.002157643
#> cyl    0.9302271  0.122245347  1.0000000 -0.80577141 -0.761912354
#> cty   -0.7985240 -0.037232291 -0.8057714  1.00000000  0.955915914
#> hwy   -0.7660200  0.002157643 -0.7619124  0.95591591  1.000000000

Correlation Heatmap

corr_heatmap() allows you to visualize the correlations by making a heatmap. This function plots data as a color-encoded Pearson correlation matrix using the longer form output returned from corr_matrix(). You can individually specify the colors for negative and positive correlations.

For example, we can plot a heatmap of all the numerical columns in mpg.

corr_heatmap(data)

Data Frame and Bar Chart of Variance Inflation Factors (VIF)

vif_bar_plot() allows you to perform linear regression, calculate the VIF scores and plot the VIF scores using a single function. The output is a list containing a tibble that includes VIF scores and a bar chart for the VIF scores alongside the specified threshold for each explanatory variable in a linear regression model. The visualization of the VIF scores alongside an adjustable thershold helps with the quick identification of the multicollinear variables.

For example we can calculate and visualize the VIF scores in a linear regression using some of the columns in mpg.

vif_bar_plot(c("displ", "cyl", "hwy"), "year", data, 5)[[1]]
#> # A tibble: 3 x 2
#>   vif_score explanatory_var
#>       <dbl> <chr>          
#> 1      7.88 displ          
#> 2      7.76 cyl            
#> 3      2.53 hwy
vif_bar_plot(c("displ", "cyl", "hwy"), "year", data, 5)[[2]]

Multicollinearity Identification based on Pearson Coefficient and VIF Scores

col_identify() allows you to eliminate explanatory variables in a linear regression model by incorporating both Pearson’s coefficient and VIF scores. The output is a data frame containing Pearson’s coefficient, VIF scores and explanatory variables suggested for elimination. If no multicollinearity is detected, the output is an empty data frame. The function employs corr_matrix() and vif_bar_plot() in the process.

For Example, we can incorporate both Pearson’s coefficient and VIF scores in a linear regression model using some of the columns in mpg.

col_identify(data, c("displ", "cyl", "hwy", "cty"), "year",
             corr_min = -0.8, corr_max = 0.8,
             vif_limit = 5)
#> Joining, by = "variable1"
#> # A tibble: 3 x 5
#> # Groups:   pair [3]
#>   variable correlation rounded_corr pair      vif_score
#>   <chr>          <dbl>        <dbl> <list>        <dbl>
#> 1 cty            0.956         0.96 <chr [2]>     13.9 
#> 2 cty           -0.806        -0.81 <chr [2]>     13.9 
#> 3 cyl            0.930         0.93 <chr [2]>      8.17