Summary

Here, we attempt to build a multiple linear regression model which is used to quantify the influence of socioeconomic factors on the COVID-19 prevalence (measured by cases per 100,000 population) among all US counties. Factors such as percentage of smokers, income ratio, population density, percent unemployed, etc. are explored. Our final regression model suggests that the percentage of smokers, teenage birth rates, unemployment rate, and a few interaction terms are significantly associated with COVID-19 prevalence at the 0.05 level. However, the original data set contained over 200 features and a subset of these features were chosen arbitrarily which means that there is still room to explore other socioeconomic features that are significantly associated with COVID-19 prevalence.

Introduction

COVID-19 is a serious pandemic that has introduced a wide variety of challenges since 2019. Increasing COVID-19 caseloads were associated with countries with higher obesity, median population age (Chaudhry et al. 2020). By analysing the association of certain socioeconomic factors with COVID-19 prevalence, we hope to shed some light onto the societal features that may be associated with a high number of COVID-19 cases. Identifying the socioeconomic factors could also help policymakers and leaders make more informed decisions in combating COVID-19.

Methods

Data

The original data set used in this project is of US social determinants of health by county created by Dr. John Davis at Indiana University, the United States. Each row in the original data set represents a day for a county with the cumulative number of COVID-19 cases, and other socioeconomic features of the county. There are over 790,000 rows and over 200 features in the data set. We identified a subset of these features which we were interested in and also added a few “wildcard” features such as the teen birth rate and chlamydia rate which might be related to broader social determinants of public health. In the future, additional features could be chosen as they become of interest to the team, or are requested by the community.

The data set reports time series data per county for the cumulative COVID-19 cases and different socioeconomic features. However, due to limits in measurements and reporting, COVID-19 cases and socioeconomic features were updated at irregular intervals (e.g. COVID-19 cases were reported daily, whereas the socioeconomic features were reported no more than once a month). Thus, we created summary statistics for the socioeconomic features per county such as the mean percentage of smokers.

The processed data contains 1621 observations and 18 features where each row corresponds to the cumulative COVID-19 cases, and aggregated socioeconomic features for each county. All missing values were removed during data wrangling.

Features table
Features Explanation
county name of county
state name of state
max_cases maximum number of Covid-19 case
avg_growth_rate average Covid-19 case growth rate per day
max_growth_rate maximum Covid-19 case growth rate per day
total_population total population
num_deaths total number of deaths
percent_smokers smokers percentage among total population
percent_vaccinated vaccinated percentage among total population
income_ratio income index
population_density_per_sqmi population density measured by per square meter
percent_fair_or_poor_health percentage of fair or poor health among total population
percent_unemployed_CHR unemployment rate measured by CHR
violent_crime_rate violent crime rate
chlamydia_rate chlamydia rate
teen_birth_rate teenager birth rate
total_cases accumulated number of Covid-19 case
deaths_per_100k number of Covid-19 deaths per 100k population
cases_per_100k number of Covid-19 cases per 100k population

Analysis

A multiple linear regression model with interaction terms was used to quantify the association of the socioeconomic features with COVID-19 prevalence. The R programming language (R Core Team 2019) and the following R packages were used to perform the analysis: broom (Robinson, Hayes, and Couch 2021), docopt (de Jonge 2018), knitr (Xie 2014), tidyverse (Wickham 2017), testhat(Wickham 2011), and here (Müller 2020). The code used to perform the analysis and create this report can be found here.

In terms of feature transformation, there were no missing values in the processed data set and thus, imputation was not required. Furthermore, for all the numeric variables, scaling was performed such that the feature has a mean of zero, and a standard deviation of 1. Interaction terms were included in the model to account for non-linear relationships between COVID-19 prevalence and the features.

Results

Exploratory Data Analysis (EDA)

Exploratory data analysis was first carried out to determine the distributions of data, as well as to get early hints into how certain socioeconomic features might be associated with COVID-19 prevalence. First, we create a summary table to check COVID-19 prevalence for each county.

COVID-19 prevalence for every county

Table 5. Top 6 counties with highest number of cumulative COVID-19 cases.
county cases_per_100k percent_smokers income_ratio percent_unemployed_CHR
Franklin 56251.54 18.075 4.439 3.971
Brown 42577.30 16.292 4.050 3.377
Marion 40167.81 19.135 4.664 4.268
Union 39026.50 17.805 4.674 4.327
Clark 36860.96 18.495 4.168 4.171
Wayne 33275.47 19.610 4.610 4.823
Table 6. Top 6 counties with lowest number of cumulative COVID-19 cases
county cases_per_100k percent_smokers income_ratio percent_unemployed_CHR
1616 Sagadahoc 483.862 13.591 3.946 2.711
1617 Rutland 445.679 14.580 4.423 3.131
1618 Windsor 406.126 13.396 4.351 2.337
1619 Piscataquis 334.429 18.901 4.650 4.165
1620 Aroostook 249.262 18.593 5.049 4.813
1621 Kauai 170.341 12.738 4.417 2.512

Distributions of numeric features

Density plots for all numeric variables are also created to check their distributions. The is a positive skew for many of the variables.

Figure 1. Density plots of numeric features

Figure 1. Density plots of numeric features

Most of the distribution of features are right-skewed. The percent_smokers has two peaks in the density plot and population_density_per_sqmi has a small cluster around 1000, and evenly distributed along all possible values.

Relationships between cumulative COVID-19 cases per 100,000 of each county and socioeconomic features

Plots to demonstrate the relationship between COVID-19 cases per 100,000 and socioeconomic features are created for each county. The linear relationships are not strong individually, however, this could be because each feature is observed in isolation. There might be interactions between these features which can have a linear relationship with COVID-19 prevalence.

Figure 2. Plots of COVID-19 cases per 100,000 against features

Figure 2. Plots of COVID-19 cases per 100,000 against features

Multiple linear regression model

There are 45 features including all the interaction terms. A subset of 10 features are selected randomly and their corresponding coefficient estimates, p-values and whether they are significant at the 0.05 level are shown.

Table 9. Coefficients of some features of the multiple linear regression model.
term estimate conf.low conf.high p.value is_sig
percent_unemployed_CHR -1035.480119 -1360.17867 -710.78156 0.0000000 TRUE
percent_fair_or_poor_health:percent_unemployed_CHR 394.047408 42.66174 745.43308 0.0279792 TRUE
percent_smokers:percent_fair_or_poor_health -613.882978 -1068.95261 -158.81334 0.0082261 TRUE
population_density_per_sqmi:chlamydia_rate -240.525299 -630.69991 149.64932 0.2267829 FALSE
violent_crime_rate 211.739489 -118.18567 541.66465 0.2082769 FALSE
percent_vaccinated:percent_unemployed_CHR -236.785678 -492.00469 18.43333 0.0689788 FALSE
percent_vaccinated:percent_fair_or_poor_health -299.976440 -762.92609 162.97321 0.2039269 FALSE
population_density_per_sqmi:percent_unemployed_CHR 470.454185 -85.80346 1026.71183 0.0973327 FALSE
income_ratio:percent_unemployed_CHR -8.633577 -313.66957 296.40241 0.9557341 FALSE
percent_smokers:population_density_per_sqmi -368.353360 -1158.78317 422.07645 0.3608157 FALSE

Coefficients of significant feature of the multiple linear regression model with 95% confidence intervals

The coefficients along with their 95% confidence intervals are plotted as error bars for features that were significant at the 0.05 level.

Figure 4. Coefficients of significant features of MLR with 95% confidence intervals.

Figure 4. Coefficients of significant features of MLR with 95% confidence intervals.

Discussion

The multiple linear regression results reveal that 7 features and the intercept term are statistically significant at the 5% significance level. Some of the significant features such as the percentage of smokers have a positive association with cases, which is reasonable since it’s common sense that smoking is a disadvantage factor associated with lung health; however, there are also some features that are unexpected, such as the teen birth rate. Moreover, it’s surprisingly seen a strong negative association between percent_unemployed_CHR and Covid-19 cases. Interestingly, features also seem to interact with one another to become significant, such as the interaction between violent crime rate and chlamydia rate. One possible explanation for significance is that these features hint at larger socioeconomic problems that are difficult to measure and quantify.

To further improve our model in the future, techniques such as PCA, feature engineering, and feature selection can be implemented to select features in a more robust method. Other non-linear models such as a random forest can also be used and compared to see if the feature importances are similar or not.

References

Chaudhry, Rabail, George Dranitsaris, Talha Mubashir, Justyna Bartoszko, and Sheila Riazi. 2020. “A Country Level Analysis Measuring the Impact of Government Actions, Country Preparedness and Socioeconomic Factors on COVID-19 Mortality and Related Health Outcomes.” EClinicalMedicine 25: 100464.
de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David, Alex Hayes, and Simon Couch. 2021. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
Wickham, Hadley. 2011. “Testthat: Get Started with Testing.” The R Journal 3: 5–10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.