Here, we attempt to build a multiple linear regression model which is used to quantify the influence of socioeconomic factors on the COVID-19 prevalence (measured by cases per 100,000 population) among all US counties. Factors such as percentage of smokers, income ratio, population density, percent unemployed, etc. are explored. Our final regression model suggests that the percentage of smokers, teenage birth rates, unemployment rate, and a few interaction terms are significantly associated with COVID-19 prevalence at the 0.05 level. However, the original data set contained over 200 features and a subset of these features were chosen arbitrarily which means that there is still room to explore other socioeconomic features that are significantly associated with COVID-19 prevalence.
COVID-19 is a serious pandemic that has introduced a wide variety of challenges since 2019. Increasing COVID-19 caseloads were associated with countries with higher obesity, median population age (Chaudhry et al. 2020). By analysing the association of certain socioeconomic factors with COVID-19 prevalence, we hope to shed some light onto the societal features that may be associated with a high number of COVID-19 cases. Identifying the socioeconomic factors could also help policymakers and leaders make more informed decisions in combating COVID-19.
The original data set used in this project is of US social determinants of health by county created by Dr. John Davis at Indiana University, the United States. Each row in the original data set represents a day for a county with the cumulative number of COVID-19 cases, and other socioeconomic features of the county. There are over 790,000 rows and over 200 features in the data set. We identified a subset of these features which we were interested in and also added a few “wildcard” features such as the teen birth rate and chlamydia rate which might be related to broader social determinants of public health. In the future, additional features could be chosen as they become of interest to the team, or are requested by the community.
The data set reports time series data per county for the cumulative COVID-19 cases and different socioeconomic features. However, due to limits in measurements and reporting, COVID-19 cases and socioeconomic features were updated at irregular intervals (e.g. COVID-19 cases were reported daily, whereas the socioeconomic features were reported no more than once a month). Thus, we created summary statistics for the socioeconomic features per county such as the mean percentage of smokers.
The processed data contains 1621 observations and 18 features where each row corresponds to the cumulative COVID-19 cases, and aggregated socioeconomic features for each county. All missing values were removed during data wrangling.
Features | Explanation |
---|---|
county | name of county |
state | name of state |
max_cases | maximum number of Covid-19 case |
avg_growth_rate | average Covid-19 case growth rate per day |
max_growth_rate | maximum Covid-19 case growth rate per day |
total_population | total population |
num_deaths | total number of deaths |
percent_smokers | smokers percentage among total population |
percent_vaccinated | vaccinated percentage among total population |
income_ratio | income index |
population_density_per_sqmi | population density measured by per square meter |
percent_fair_or_poor_health | percentage of fair or poor health among total population |
percent_unemployed_CHR | unemployment rate measured by CHR |
violent_crime_rate | violent crime rate |
chlamydia_rate | chlamydia rate |
teen_birth_rate | teenager birth rate |
total_cases | accumulated number of Covid-19 case |
deaths_per_100k | number of Covid-19 deaths per 100k population |
cases_per_100k | number of Covid-19 cases per 100k population |
A multiple linear regression model with interaction terms was used to quantify the association of the socioeconomic features with COVID-19 prevalence. The R programming language (R Core Team 2019) and the following R packages were used to perform the analysis: broom (Robinson, Hayes, and Couch 2021), docopt (de Jonge 2018), knitr (Xie 2014), tidyverse (Wickham 2017), testhat(Wickham 2011), and here (Müller 2020). The code used to perform the analysis and create this report can be found here.
In terms of feature transformation, there were no missing values in the processed data set and thus, imputation was not required. Furthermore, for all the numeric variables, scaling was performed such that the feature has a mean of zero, and a standard deviation of 1. Interaction terms were included in the model to account for non-linear relationships between COVID-19 prevalence and the features.
Exploratory data analysis was first carried out to determine the distributions of data, as well as to get early hints into how certain socioeconomic features might be associated with COVID-19 prevalence. First, we create a summary table to check COVID-19 prevalence for each county.
county | cases_per_100k | percent_smokers | income_ratio | percent_unemployed_CHR |
---|---|---|---|---|
Franklin | 56251.54 | 18.075 | 4.439 | 3.971 |
Brown | 42577.30 | 16.292 | 4.050 | 3.377 |
Marion | 40167.81 | 19.135 | 4.664 | 4.268 |
Union | 39026.50 | 17.805 | 4.674 | 4.327 |
Clark | 36860.96 | 18.495 | 4.168 | 4.171 |
Wayne | 33275.47 | 19.610 | 4.610 | 4.823 |
county | cases_per_100k | percent_smokers | income_ratio | percent_unemployed_CHR | |
---|---|---|---|---|---|
1616 | Sagadahoc | 483.862 | 13.591 | 3.946 | 2.711 |
1617 | Rutland | 445.679 | 14.580 | 4.423 | 3.131 |
1618 | Windsor | 406.126 | 13.396 | 4.351 | 2.337 |
1619 | Piscataquis | 334.429 | 18.901 | 4.650 | 4.165 |
1620 | Aroostook | 249.262 | 18.593 | 5.049 | 4.813 |
1621 | Kauai | 170.341 | 12.738 | 4.417 | 2.512 |
Density plots for all numeric variables are also created to check their distributions. The is a positive skew for many of the variables.
Most of the distribution of features are right-skewed. The percent_smokers has two peaks in the density plot and population_density_per_sqmi has a small cluster around 1000, and evenly distributed along all possible values.
Plots to demonstrate the relationship between COVID-19 cases per 100,000 and socioeconomic features are created for each county. The linear relationships are not strong individually, however, this could be because each feature is observed in isolation. There might be interactions between these features which can have a linear relationship with COVID-19 prevalence.
There are 45 features including all the interaction terms. A subset of 10 features are selected randomly and their corresponding coefficient estimates, p-values and whether they are significant at the 0.05 level are shown.
term | estimate | conf.low | conf.high | p.value | is_sig |
---|---|---|---|---|---|
percent_unemployed_CHR | -1035.480119 | -1360.17867 | -710.78156 | 0.0000000 | TRUE |
percent_fair_or_poor_health:percent_unemployed_CHR | 394.047408 | 42.66174 | 745.43308 | 0.0279792 | TRUE |
percent_smokers:percent_fair_or_poor_health | -613.882978 | -1068.95261 | -158.81334 | 0.0082261 | TRUE |
population_density_per_sqmi:chlamydia_rate | -240.525299 | -630.69991 | 149.64932 | 0.2267829 | FALSE |
violent_crime_rate | 211.739489 | -118.18567 | 541.66465 | 0.2082769 | FALSE |
percent_vaccinated:percent_unemployed_CHR | -236.785678 | -492.00469 | 18.43333 | 0.0689788 | FALSE |
percent_vaccinated:percent_fair_or_poor_health | -299.976440 | -762.92609 | 162.97321 | 0.2039269 | FALSE |
population_density_per_sqmi:percent_unemployed_CHR | 470.454185 | -85.80346 | 1026.71183 | 0.0973327 | FALSE |
income_ratio:percent_unemployed_CHR | -8.633577 | -313.66957 | 296.40241 | 0.9557341 | FALSE |
percent_smokers:population_density_per_sqmi | -368.353360 | -1158.78317 | 422.07645 | 0.3608157 | FALSE |
The coefficients along with their 95% confidence intervals are plotted as error bars for features that were significant at the 0.05 level.
The multiple linear regression results reveal that 7 features and the intercept term are statistically significant at the 5% significance level. Some of the significant features such as the percentage of smokers have a positive association with cases, which is reasonable since it’s common sense that smoking is a disadvantage factor associated with lung health; however, there are also some features that are unexpected, such as the teen birth rate. Moreover, it’s surprisingly seen a strong negative association between percent_unemployed_CHR and Covid-19 cases. Interestingly, features also seem to interact with one another to become significant, such as the interaction between violent crime rate and chlamydia rate. One possible explanation for significance is that these features hint at larger socioeconomic problems that are difficult to measure and quantify.
To further improve our model in the future, techniques such as PCA, feature engineering, and feature selection can be implemented to select features in a more robust method. Other non-linear models such as a random forest can also be used and compared to see if the feature importances are similar or not.