Summary
Introduction
Methods
- Data
- Analysis
Results
- Exploratory Data Analysis (EDA)
- Multiple linear regression model
Discussion
References

Summary

Here, we attempt to build a multiple linear regression model which is used to quantify the influence of socioeconomic factors on the COVID-19 prevalence (measured by cases per 100,000 population) among all US counties. Factors such as percentage of smokers, income ratio, population density, percent unemployed, etc. are explored. Our final regression model suggests that the percentage of smokers, teenage birth rates, unemployment rate, and a few interaction terms are significantly associated with COVID-19 prevalence at the 0.05 level. However, the original data set contained over 200 features and a subset of these features were chosen arbitrarily which means that there is still room to explore other socioeconomic features that are significantly associated with COVID-19 prevalence.

Introduction

COVID-19 is a serious pandemic that has introduced a wide variety of challenges since 2019. Increasing COVID-19 caseloads were associated with countries with higher obesity, median population age (Chaudhry et al. 2020). By analysing the association of certain socioeconomic factors with COVID-19 prevalence, we hope to shed some light onto the societal features that may be associated with a high number of COVID-19 cases. Identifying the socioeconomic factors could also help policymakers and leaders make more informed decisions in combating COVID-19.

Methods

Data

The original data set used in this project is of US social determinants of health by county created by Dr. John Davis at Indiana University, the United States. Each row in the original data set represents a day for a county with the cumulative number of COVID-19 cases, and other socioeconomic features of the county. There are over 790,000 rows and over 200 features in the data set. We identified a subset of these features which we were interested in and also added a few “wildcard” features such as the teen birth rate and chlamydia rate which might be related to broader social determinants of public health. In the future, additional features could be chosen as they become of interest to the team, or are requested by the community.

The data set reports time series data per county for the cumulative COVID-19 cases and different socioeconomic features. However, due to limits in measurements and reporting, COVID-19 cases and socioeconomic features were updated at irregular intervals (e.g. COVID-19 cases were reported daily, whereas the socioeconomic features were reported no more than once a month). Thus, we created summary statistics for the socioeconomic features per county such as the mean percentage of smokers.

The processed data contains 1621 observations and 18 features where each row corresponds to the cumulative COVID-19 cases, and aggregated socioeconomic features for each county. All missing values were removed during data wrangling.

Features table
Features	Explanation
county	name of county
state	name of state
max_cases	maximum number of Covid-19 case
avg_growth_rate	average Covid-19 case growth rate per day
max_growth_rate	maximum Covid-19 case growth rate per day
total_population	total population
num_deaths	total number of deaths
percent_smokers	smokers percentage among total population
percent_vaccinated	vaccinated percentage among total population
income_ratio	income index
population_density_per_sqmi	population density measured by per square meter
percent_fair_or_poor_health	percentage of fair or poor health among total population
percent_unemployed_CHR	unemployment rate measured by CHR
violent_crime_rate	violent crime rate
chlamydia_rate	chlamydia rate
teen_birth_rate	teenager birth rate
total_cases	accumulated number of Covid-19 case
deaths_per_100k	number of Covid-19 deaths per 100k population
cases_per_100k	number of Covid-19 cases per 100k population

Analysis

A multiple linear regression model with interaction terms was used to quantify the association of the socioeconomic features with COVID-19 prevalence. The R programming language (R Core Team 2019) and the following R packages were used to perform the analysis: broom (Robinson, Hayes, and Couch 2021), docopt (de Jonge 2018), knitr (Xie 2014), tidyverse (Wickham 2017), testhat(Wickham 2011), and here (Müller 2020). The code used to perform the analysis and create this report can be found here.

In terms of feature transformation, there were no missing values in the processed data set and thus, imputation was not required. Furthermore, for all the numeric variables, scaling was performed such that the feature has a mean of zero, and a standard deviation of 1. Interaction terms were included in the model to account for non-linear relationships between COVID-19 prevalence and the features.

Results

Exploratory Data Analysis (EDA)

Exploratory data analysis was first carried out to determine the distributions of data, as well as to get early hints into how certain socioeconomic features might be associated with COVID-19 prevalence. First, we create a summary table to check COVID-19 prevalence for each county.

COVID-19 prevalence for every county

Table 5. Top 6 counties with highest number of cumulative COVID-19 cases.
county	cases_per_100k	percent_smokers	income_ratio	percent_unemployed_CHR
Franklin	56251.54	18.075	4.439	3.971
Brown	42577.30	16.292	4.050	3.377
Marion	40167.81	19.135	4.664	4.268
Union	39026.50	17.805	4.674	4.327
Clark	36860.96	18.495	4.168	4.171
Wayne	33275.47	19.610	4.610	4.823

Table 6. Top 6 counties with lowest number of cumulative COVID-19 cases
	county	cases_per_100k	percent_smokers	income_ratio	percent_unemployed_CHR
1616	Sagadahoc	483.862	13.591	3.946	2.711
1617	Rutland	445.679	14.580	4.423	3.131
1618	Windsor	406.126	13.396	4.351	2.337
1619	Piscataquis	334.429	18.901	4.650	4.165
1620	Aroostook	249.262	18.593	5.049	4.813
1621	Kauai	170.341	12.738	4.417	2.512

Distributions of numeric features

Density plots for all numeric variables are also created to check their distributions. The is a positive skew for many of the variables.

Figure 1. Density plots of numeric features

Most of the distribution of features are right-skewed. The percent_smokers has two peaks in the density plot and population_density_per_sqmi has a small cluster around 1000, and evenly distributed along all possible values.

Relationships between cumulative COVID-19 cases per 100,000 of each county and socioeconomic features

Plots to demonstrate the relationship between COVID-19 cases per 100,000 and socioeconomic features are created for each county. The linear relationships are not strong individually, however, this could be because each feature is observed in isolation. There might be interactions between these features which can have a linear relationship with COVID-19 prevalence.

Figure 2. Plots of COVID-19 cases per 100,000 against features

Multiple linear regression model

There are 45 features including all the interaction terms. A subset of 10 features are selected randomly and their corresponding coefficient estimates, p-values and whether they are significant at the 0.05 level are shown.

Table 9. Coefficients of some features of the multiple linear regression model.
term	estimate	conf.low	conf.high	p.value	is_sig
percent_unemployed_CHR	-1035.480119	-1360.17867	-710.78156	0.0000000	TRUE
percent_fair_or_poor_health:percent_unemployed_CHR	394.047408	42.66174	745.43308	0.0279792	TRUE
percent_smokers:percent_fair_or_poor_health	-613.882978	-1068.95261	-158.81334	0.0082261	TRUE
population_density_per_sqmi:chlamydia_rate	-240.525299	-630.69991	149.64932	0.2267829	FALSE
violent_crime_rate	211.739489	-118.18567	541.66465	0.2082769	FALSE
percent_vaccinated:percent_unemployed_CHR	-236.785678	-492.00469	18.43333	0.0689788	FALSE
percent_vaccinated:percent_fair_or_poor_health	-299.976440	-762.92609	162.97321	0.2039269	FALSE
population_density_per_sqmi:percent_unemployed_CHR	470.454185	-85.80346	1026.71183	0.0973327	FALSE
income_ratio:percent_unemployed_CHR	-8.633577	-313.66957	296.40241	0.9557341	FALSE
percent_smokers:population_density_per_sqmi	-368.353360	-1158.78317	422.07645	0.3608157	FALSE

Coefficients of significant feature of the multiple linear regression model with 95% confidence intervals

The coefficients along with their 95% confidence intervals are plotted as error bars for features that were significant at the 0.05 level.

Figure 4. Coefficients of significant features of MLR with 95% confidence intervals.

Discussion

The multiple linear regression results reveal that 7 features and the intercept term are statistically significant at the 5% significance level. Some of the significant features such as the percentage of smokers have a positive association with cases, which is reasonable since it’s common sense that smoking is a disadvantage factor associated with lung health; however, there are also some features that are unexpected, such as the teen birth rate. Moreover, it’s surprisingly seen a strong negative association between percent_unemployed_CHR and Covid-19 cases. Interestingly, features also seem to interact with one another to become significant, such as the interaction between violent crime rate and chlamydia rate. One possible explanation for significance is that these features hint at larger socioeconomic problems that are difficult to measure and quantify.

To further improve our model in the future, techniques such as PCA, feature engineering, and feature selection can be implemented to select features in a more robust method. Other non-linear models such as a random forest can also be used and compared to see if the feature importances are similar or not.

References

Chaudhry, Rabail, George Dranitsaris, Talha Mubashir, Justyna Bartoszko, and Sheila Riazi. 2020. “A Country Level Analysis Measuring the Impact of Government Actions, Country Preparedness and Socioeconomic Factors on COVID-19 Mortality and Related Health Outcomes.” EClinicalMedicine 25: 100464.

de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.

Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robinson, David, Alex Hayes, and Simon Couch. 2021. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.

Wickham, Hadley. 2011. “Testthat: Get Started with Testing.” The R Journal 3: 5–10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

Report for US social determinants of health by county dataset

Joshua Sia, Morgan Rosenberg, Sufang Tan, Yinan Guo (Group 25)

2021/11/27