Lecture 6 - Observational Data: Stratifying and Modelling#

Today’s Learning Objectives#

  1. Define a set of variables that control for confounding.

  2. Analyze stratified observational data.

  3. Interpret regression models fittings for observational data.

  4. Supply appropriate caveats for causal claims from fitting regression models to observational data.

  5. Comment on the differing goals of regression for prediction versus regression for explanation.

Loading R Packages#

options(repr.matrix.max.rows = 8, repr.matrix.max.cols = 10)
library(foreign)
library(tidyverse)
library(broom)
library(scales)

Where are we at?#

We have been devoting most of our time to randomized experiments (i.e., A/B or A/B/n testing). This is the ideal setting where we actually have control over the data collection and treatment randomization. Moreover, this class of statistical framework certainly allows us to draw causal conclusions on a given outcome of interest.

Recall that if we can randomize experimental units to the two groups of conditions being compared (e.g, control treatment A and experimental treatment B), then data analysis and causal interpretation can be clear. Hence, we can be confident in our A/B testing decisions. Furthermore, this treatment randomization setting will be crucial to control for confounding variables on a high level.

For example, as long as we have an adequate overall sample size (via Power Analysis!) and randomize properly, we can gather statistical evidence to state that website version B directly causes better sales than website version A. Consequently, we will be convinced that a permanent switch to website version B will be profitable.

Why are we allowed to make this class of statements? Because treatment randomization balances out all confounding variables, so that the two groups formed differ only in which websitsite they see.

Moreover, suppose all website visitors could hypothetically be labelled as either “fussy” or “easy-to-please.” Therefore, it is tremendously unlikely to end up with a much higher proportion of fussy people in the website version A than in B via treatment randomization.

Consequently, excuses like “A did not win because it was faced with a harder-to-please crowd” will not stand up.

1. Overview of Observational Studies#

We will shift gears to observational studies. Unlike randomized experiments, we can only gather data from a population of interest via a proper and representative sample. In observational studies, the researcher WILL NOT have control over the variable of interest \(X\).

1.1. Without randomization, life is much harder!#

For instance, if people self-select whether to see the A or B website version (such as one of our examples in Lecture 2 - Confounding and Randomized versus Non-randomized Studies where we needed to account for gender and age in our TikTok® Simulation of an observational study), you would really want to measure their fussiness level and use this when analyzing the data. This practice will allow you to estimate accurate and precise effect estimates of your variable \(X\) on the response \(Y\).

In general, the observational study strategy includes:

  1. To the extent possible, recording potential confounding variables as part of the data collection process.

  2. Using these potential confounding variables as part of the data analysis.

  3. Tempering the causal strength of claims in light of the inherent challenges in (1) and (2).

Furthermore, note that (1) implicates a careful data collection planning.

Exercise 20

Answer TRUE or FALSE:

Identifying potential confounding variables is not an integral part of a careful data collection planning in an observational study.

A. TRUE

B. FALSE

1.2. Where are observational studies in a Data Science pipeline?#

Observational studies are heavily used in pharmaco-epidemiology where we can find massive healthcare databases for population health research. For instance, insurance companies and healthcare systems collect this class of data in the US.

On the other hand, countries like Canada use their national healthcare system to gather this data. You can check www.popdata.bc.ca for some rough flavour.

In addition to statistical methodology issues, this sort of work heavily involves:

  • Database and data wrangling issues (defining and extracting populations/samples/variables).

  • Privacy, ethics, and security (you already addressed these matters in DSCI 541). For a sense of how high the stakes are on this, you can google “BC health researchers fired.”

Therefore, we generally have an outcome \(Y\) (which could be continuous OR discrete) subject to a variable of interest \(X\) and many confounders \(C_j\) which will vary in their nature: continuous OR discrete.

For instance, in a pharmaco-epidemiology problem:

  • \(Y\) is a binary indicator of bad disease state two years after initial diagnosis.

  • \(X\) is a binary indicator of initiated drug A versus drug B within three months of initial diagnosis.

  • \(C_j\) is all sorts of demographic variables and health-status variables.

We aspire to use these observational data to determine what we would see if we could do a randomized trial with this \(X\) and \(Y\).

Warning

Merely addressing whether \(X\) is associated with \(Y\) IS NOT HELPFUL ANYMORE. In the previous pharmaco-epidemiology problem, the relevant policy question is: would fewer patients reach the bad disease state in a world where all were given drug B compared to a world where all were given drug A?

2. The Western Collaborative Group Study Data#

As in the case of A/B testing data, pharmaco-epidemiology data is hard to come by! Hence, we will work with a more traditional biostatistics example from Vitinghoff et al. (2012). This dataset is called the Western Collaborative Group Study (WCGS) data.

The WCGS data is a prospective cohort study that recruited middle-aged men between the ages of 39 and 59 who were employees of ten Californian companies (this is the population for which we will draw our conclusions!). The data was collected on 3154 subjects between 1960 and 1961.

Important

A prospective cohort study involves following \(n\) subjects over time. Certain confounding variables can group these subjects. Nonetheless, these groups will vary by a certain variable of interest (i.e., the \(X\) as a factor-type variable with certain levels of interest). We will assess whether the changes from one level to another in \(X\) cause the change of a particular outcome \(Y\).

2.1. Data Wrangling#

The data wcgs.dta is in Stata format. We will use the function read.dta() from foreign as a tibble. We will eliminate those rows with missing data via na.omit() (missing data will not be the scope of today’s lecture!). We end up with \(n = 3,101\) subjects.

wcgs <- as_tibble(read.dta("../data/wcgs.dta"))
wcgs <- na.omit(wcgs)
wcgs
A tibble: 3101 × 22
agearcusbehpatbmichd69typchd69uniweightwghtcatagec
<int><int><fct><dbl><fct><int><dbl><int><fct><fct>
501A131.32101No00.4860738200170-20046-50
510A125.32858No00.1859543192170-20051-55
591A128.69388No00.7277991200170-20056-60
511A122.14871No00.6244636150140-17051-55
410B425.03378No00.27816987195170-20041-45
530B425.79944No00.95139700185170-20051-55
540B423.49076No00.57082593150140-17051-55
480B430.26990No00.08677829205> 200 46-50

Our raw data wcgs has 22 demographic and health-related variables in total. However, our statistical modelling will focus on the following ones:

  • dibpat: Dichotomous behaviour pattern, a factor-type variable with two levels (Type A and Type B).

  • age: Subject’s age in years (a count-type variable).

  • bmi: Subject’s body mass index (BMI) in \(\text{kg}/\text{m}^2\) (a continuous-type variable).

  • chol: Subject’s cholesterol levels in \(\text{mg}/100 \text{ ml}\) (a continuous-type variable).

  • smoke: Whether the subject smokes or not, a factor-type variable with two levels (Yes and No).

  • chd69: Whether the subject experienced a coronary heart disease (CHD) event, a factor-type variable with two levels (Yes and No).

These variables are stored in training_data.

training_data <- wcgs %>%
  select(dibpat, age, bmi, chol, smoke, chd69)
training_data
A tibble: 3101 × 6
dibpatagebmicholsmokechd69
<fct><int><dbl><int><fct><fct>
Type A5031.32101249YesNo
Type A5125.32858194YesNo
Type A5928.69388258No No
Type A5122.14871173No No
Type B4125.03378180No No
Type B5325.79944167No No
Type B5423.49076242YesNo
Type B4830.26990288No No

2.2. Main Statistical Inquiry#

We are interested in the following: does a Type A behaviour pattern (dibpat) LEAD to a CHD event (chd69)? From Brand et al. (1976), the levels of dibpat are defined as follows:

Methods for the assessment of the dichotomous behavior pattern by a structured psychological interview in the WCGS have been described elsewhere. Type A behavior is characterized by enhanced aggressiveness and competitive drive, preoccupation with deadlines, and chronic impatience and sense of time urgency, in contrast with the more relaxed and less hurried Type B behavior pattern.

Hence, we have an \(X\) variable, a \(Y\) variable, and different potential \(C_j\) variables. Which variables are we focusing on?

  • \(Y\) is binary indicator of coronary heart disease (chd69).

  • \(X\) is binary indicator of Type A behaviour (dibpat).

2.3. Exploratory Data Analysis#

Before proceeding with the statistical analysis, let us do a quick exploratory data analysis (EDA). Since we have two binary variables of interest, stacked bar charts by proportions are a suitable choice. The plot shows a slight increase in CHD in Type A behaviour (at least graphically!). Nevertheless, as usual, we need to dig deeper into the data via some suitable regression model!

options(repr.plot.height = 8, repr.plot.width = 14)

wcgs_prop_summary <- as.data.frame(xtabs(
  ~ dibpat + chd69,
  training_data
) / rowSums(xtabs(
  ~ dibpat + chd69,
  training_data
)), responseName = "prop")

wcgs_data_stacked_bars <- ggplot(wcgs_prop_summary, aes(x = dibpat, y = prop, fill = chd69)) +
  geom_bar(stat = "identity", width = 0.7, colour = "black", linewidth = 0.1) +
  geom_text(aes(label = ifelse(prop >= 0.05, paste0(sprintf("%.0f", prop*100),"%"),"")),
            position = position_stack(vjust = 0.5), colour = "firebrick3", fontface = "bold", size = 6) +
  scale_y_continuous(labels = percent_format()) +
  labs(y = "Percentage", x = "Dichotomous Behavior Pattern", fill = "") +
  ggtitle("Stacked Bar Charts")  +
  theme(plot.title = element_text(size = 24, face = "bold"),
        axis.text.x = element_text(size = 17, angle = 0),
        axis.text.y = element_text(size = 17, angle = 0),
        axis.title = element_text(size = 21),
        legend.text = element_text(size = 17, margin = margin(r = 1, unit = "cm")), 
        legend.title = element_text(size = 21, face = "bold")) +
  guides(fill = guide_legend(title = "Coronary Heart Disease Event")) + scale_fill_brewer(palette = "Blues")
wcgs_data_stacked_bars
../_images/67eb50b2b9fd59c4e26fe8bd04747f68e2418935dad187cad0592e3bdb48b86b.png

3. Key Statistical Concepts#

Let us mathematically define \(Y\) and \(X\) as an initial setup. Then, for the \(i\)th subject, the response chd69 is defined as:

\[\begin{split} Y_i = \begin{cases} 1 \; \; \; \; \mbox{if the $i$th subject experienced a CHD event},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \end{split}\]

The variable dibpat is defined as:

\[\begin{split} X_i = \begin{cases} 1 \; \; \; \; \mbox{if the $i$th subject has a Type A behaviour pattern},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \end{split}\]

Once we have defined these variables of interest, let us continue with some key statistical concepts.

3.1. Log-odds#

Recall the “1” category in \(Y_i\) is referred as success. Note each \(Y_i\) is a Bernoulli trial whose probability of success is \(\pi_i\), i.e., \(Y_i \sim \text{Bernoulli}(\pi_i)\). We are already familiar with the log-odds:

\[ \log\left( \frac{\pi_i}{1 - \pi_i} \right). \]

This is the natural logarithm of the ratio of the probability of the event to the probability of the non-event (i.e., the odds!) in \(Y_i\). Moreover, in Binary Logistic regression, our regression coefficients can be interpreted in terms of the log-odds.

Important

The log-odds is to binary variables what correlation is to continuous variables.

3.2. Estimator of the Log-Odds for One Group#

In a simple case with a sample size \(n\), if we just have one variable (e.g., CHD or not), we can estimate the log-odds by just taking the logarithm of the ratio of the number of patients with CHD to those without CHD:

\[ \text{log-odds} = \log\left( \frac{n_{\text{yes}}/n}{n_{\text{no}}/n} \right) = \log\left( \frac{n_{\text{yes}}}{n_{\text{no}}} \right). \]

3.3. Estimator of the Log-Odds Ratio for Two Groups#

Now, let us add the variable \(X\) to our statistical metrics besides the outcome \(Y\). Suppose both \(X\) and \(Y\) are binary variables. Then, the absolute frequencies by each combination of factor levels can be put in a contingency table as follows:

\(Y = 1\)

\(Y = 0\)

\(X = 1\)

\(n_{X = 1, Y = 1}\)

\(n_{X = 1,Y = 0}\)

\(X = 0\)

\(n_{X = 0, Y = 1}\)

\(n_{X = 0, Y =0}\)

Attention

A contingency table (i.e., cross-tabulation) is merely a way to summarize the absolute frequencies between two categorical variables.

The odds ratio (OR) measures the association/causation between the binary exposure variable \(X\) and the binary outcome \(Y\). It will indicate the odds that the outcome \(Y\) will happen given that the exposure \(X\) is present compared to the odds the outcome \(Y\) will happen if the exposure \(X\) is absent. It can be represented as follows:

\[\text{OR} = \frac{\frac{n_{X = 1, Y = 1}/ n}{n_{X = 1,Y = 0} / n}}{\frac{n_{X = 0, Y = 1}/ n}{n_{X = 0,Y = 0} / n}} = \frac{\frac{n_{X = 1, Y = 1}}{n_{X = 1,Y = 0}}}{\frac{n_{X = 0, Y = 1}}{n_{X = 0,Y = 0}}} = \frac{n_{X = 1, Y = 1} \times n_{X = 0,Y = 0}}{n_{X = 1,Y = 0} \times n_{X = 0, Y = 1}}.\]

The OR can be interpreted as follows:

  • \(\text{OR} = 1\) indicates the binary exposure \(X\) DOES NOT AFFECT the odds of the binary outcome \(Y\).

  • \(\text{OR} > 1\) indicates the binary exposure \(X\) IS ASSOCIATED WITH (OR CAUSES!) higher odds of the binary outcome \(Y\).

  • \(\text{OR} < 1\) indicates the binary exposure \(X\) IS ASSOCIATED WITH (OR CAUSES!) lower odds of the binary outcome \(Y\).

Analogously to the estimator of the log-odds for one group, we can also have an estimator of the log-OR as follows:

\[\hat{\text{log-OR}} = \log \left( \frac{\frac{n_{X = 1, Y = 1}}{n_{X = 1,Y = 0}}}{\frac{n_{X = 0, Y = 1}}{n_{X = 0,Y = 0}}} \right) = \log \left( \frac{n_{X = 1, Y = 1}}{n_{X = 1,Y = 0}} \right) - \log \left( \frac{n_{X = 0, Y = 1}}{n_{X = 0,Y = 0}} \right)\]
\[\hat{\text{log-OR}} = \log(n_{X = 1, Y = 1}) - \log(n_{X = 1,Y = 0}) - \big[ \log(n_{X = 0, Y = 1}) - \log(n_{X = 0,Y = 0}) \big].\]

This estimator is approximately Normal with a large enough sample size \(n\). Its standard error (SE) is:

\[\text{SE} = \sqrt{\frac{1}{n_{X = 1, Y = 1}} + \frac{1}{n_{X = 1,Y = 0}} + \frac{1}{n_{X = 0, Y = 1}} + \frac{1}{n_{X = 0, Y =0}}}.\]

We can set up the following R functions with a contingency table as an input following the previous formulas. The last function, logOR_z, is the test statistic logOR_z associated to the hypothesis:

\[H_0\text{: log-OR} = 0\]
\[H_a\text{: log-OR} \neq 0\]

Failing to reject \(H_0\) gives statistical evidence that the binary exposure \(X\) DOES NOT AFFECT the odds of the binary outcome \(Y\) with a significance level \(\alpha\).

logOR_est <- function(table) {
  log(table[2, 2]) - log(table[2, 1]) - (log(table[1, 2]) - log(table[1, 1]))
}

logOR_se <- function(table) {
  sqrt(sum(1 / table))
}

logOR_z <- function(table) {
  logOR_est(table) / logOR_se(table)
}

3.4. Applying the log-OR to the WCGS Case#

Let us apply the concept of the log-OR to our training data for our two target variables: the exposure \(X\) called behaviour pattern (dibpat) and the outcome \(Y\) called CHD event (chd69). Function table() computes the corresponding contingency tables.

cont_table <- table(training_data$dibpat, training_data$chd69)
cont_table
        
           No  Yes
  Type B 1459   78
  Type A 1391  173

Then, we use the function logOR_est().

round(logOR_est(cont_table), 3)
0.844

We can obtain the interpretation of the OR by exponentiating the log-OR.

An \(OR > 1\) indicates the Type A behaviour pattern IS ASSOCIATED WITH (OR POSSIBLY CAUSES!) higher odds of having a CHD event.

round(exp(logOR_est(cont_table)), 3)
2.326

We will also obtain the corresponding SE and test statistic associated with the estimated log-OR.

round(logOR_se(cont_table), 3)
0.141
round(logOR_z(cont_table), 3)
5.969

Finally, it can be proved that the previous estimated log-OR, SE, and test statistic can be obtained with a Binomial Logistic regression model with a single binary regressor \(X\). We fit this model and call it initial_bin_log.

initial_bin_log <- glm(chd69 ~ dibpat, data = training_data, family = binomial)
tidy(initial_bin_log, conf.int = 0.95) %>% mutate_if(is.numeric, round, 3)
A tibble: 2 × 7
termestimatestd.errorstatisticp.valueconf.lowconf.high
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
(Intercept) -2.9290.116-25.2030-3.165-2.708
dibpatType A 0.8440.141 5.9700 0.571 1.126

Exercise 21

Can we infer causality between dibpat and chd69 with this initial_bin_log?

A. Yes.

B. No.

4. Stratified Analysis for Causality#

Stratification implicates slicing our data based on different confounding variables. This will yield a stratified analysis for causality.

Attention

Often, one sees the term stratification as a short-hand for stratified sampling (i.e., collecting subsamples within each stratum). However, this will not be the case in this analysis.

4.1. Stratum-Specific Inference with a Single Counfounder#

In a stratified analysis for causality, we can make bins with a given continuous confounder. Then, we will stratify our sampled subjects by each confounder bin according to their observed value. In an observational study, this is the PROXY to blocking in an experiment.

A fair stratifying practice with a continuous confounder is to build quartile stratum, i.e., split the ordered data into four parts (as we do in boxplots!). We will do it for age in training_data.

training_data <- training_data %>%
  mutate(age_bins = cut(age, breaks = c(min(age), quantile(age, (1:3) / 4), max(age)), include.lowest = TRUE))
levels(training_data$age_bins)
  1. '[39,42]'
  2. '(42,45]'
  3. '(45,50]'
  4. '(50,59]'
head(training_data)
A tibble: 6 × 7
dibpatagebmicholsmokechd69age_bins
<fct><int><dbl><int><fct><fct><fct>
Type A5031.32101249YesNo(45,50]
Type A5125.32858194YesNo(50,59]
Type A5928.69388258No No(50,59]
Type A5122.14871173No No(50,59]
Type A4422.31303214No No(42,45]
Type A4727.11768206YesNo(45,50]

Then, as we did it previously with the overall contingency table, we will build four specific contingency tables by each age stratum of the \(X\) dibpat versus the \(Y\) chd69.

attach(training_data)
table(dibpat, chd69, age_bins)
, , age_bins = [39,42]

        chd69
dibpat    No Yes
  Type B 504  17
  Type A 432  30

, , age_bins = (42,45]

        chd69
dibpat    No Yes
  Type B 326  15
  Type A 262  21

, , age_bins = (45,50]

        chd69
dibpat    No Yes
  Type B 340  21
  Type A 332  49

, , age_bins = (50,59]

        chd69
dibpat    No Yes
  Type B 289  25
  Type A 365  73

Then, we obtain the estimate log-OR by age stratum.

log_OR_age <- matrix(round(apply(table(dibpat, chd69, age_bins), 3, logOR_est), 3), ncol = 4)
colnames(log_OR_age) <- levels(training_data$age_bins)
log_OR_age
A matrix: 1 × 4 of type dbl
[39,42](42,45](45,50](50,59]
0.7220.5550.8710.838

If we exponentiate the previous estimates, we obtain the corresponding ORs by age stratum. Note that all ORs are larger than 1.

OR_age <- matrix(round(exp(apply(table(dibpat, chd69, age_bins), 3, logOR_est)), 3), ncol = 4)
colnames(OR_age) <- levels(training_data$age_bins)
OR_age
A matrix: 1 × 4 of type dbl
[39,42](42,45](45,50](50,59]
2.0591.7422.392.312

We also obtain the SEs and test statistics by age stratum.

sd_log_OR_age <- matrix(round(apply(table(dibpat, chd69, age_bins), 3, logOR_se), 3), ncol = 4)
colnames(sd_log_OR_age) <- levels(training_data$age_bins)
sd_log_OR_age
A matrix: 1 × 4 of type dbl
[39,42](42,45](45,50](50,59]
0.3110.3480.2720.245

With \(\alpha = 0.05\) and a two-sided test

\[H_0\text{: log-OR} = 0\]
\[H_a\text{: log-OR} \neq 0,\]

our quantile cuttoff value \(z_{1 - \alpha/2} = 1.96\). Therefore, we have statistical evidence to reject \(H_0\) (recall \(H_0\) indicates the binary exposure \(X\) DOES NOT AFFECT the odds of the binary outcome \(Y\)) for all age stratum except for (42,45].

test_log_OR_age <- matrix(round(apply(table(dibpat, chd69, age_bins), 3, logOR_z), 3), ncol = 4)
colnames(test_log_OR_age) <- levels(training_data$age_bins)
test_log_OR_age
A matrix: 1 × 4 of type dbl
[39,42](42,45](45,50](50,59]
2.3251.5943.2033.424

What did we just do in this whole analysis?

Assuming age is the only possible confounder in this population, we take away its potential of biasing our effect estimates of \(X\) on \(Y\). Ideally, this stratification technique will yield more accurate estimates in the form of log-ORs.

Note you can make different age strata if necessary. This will depend on what is specifically of interest for your main statistical inquiry.

4.2. Stratum-Specific Causal Inference with Multiple Counfounders#

Now that we have done this stratified analysis by age, it is also possible to expand it to multiple confounders \(C_j\). Hence, let us incorporate bmi as another confounder in the form of four strata by quartile.

detach(training_data)
training_data <- training_data %>%
  mutate(bmi_bins = cut(bmi, breaks = c(min(bmi), quantile(bmi, (1:3) / 4), max(bmi)), include.lowest = TRUE))
levels(training_data$bmi_bins)
  1. '[11.2,23]'
  2. '(23,24.4]'
  3. '(24.4,25.8]'
  4. '(25.8,38.9]'

If we stratify by age group (4 levels), smoking status (2 levels), BMI quartile (4 levels); we will end with following possible combinations of stratum levels:

\[4 \times 2 \times 4 = 32\]
head(training_data)
A tibble: 6 × 8
dibpatagebmicholsmokechd69age_binsbmi_bins
<fct><int><dbl><int><fct><fct><fct><fct>
Type A5031.32101249YesNo(45,50](25.8,38.9]
Type A5125.32858194YesNo(50,59](24.4,25.8]
Type A5928.69388258No No(50,59](25.8,38.9]
Type A5122.14871173No No(50,59][11.2,23]
Type A4422.31303214No No(42,45][11.2,23]
Type A4727.11768206YesNo(45,50](25.8,38.9]

Then, we build 32 specific contingency tables by each stratum of the \(X\) dibpat versus the \(Y\) chd69.

attach(training_data)

mult_cont_tables <- table(dibpat, chd69, age_bins, smoke, bmi_bins)

The object mult_cont_tables is multidimensional as follows:

dim(mult_cont_tables)
  1. 2
  2. 2
  3. 4
  4. 2
  5. 4

We can access a given contingency table with the syntax below (for instance, for stratum (45,50] in age, Yes in smoke, and (25.8,38.9] in bmi).

mult_cont_tables[ , , 3, 2, 4]
        chd69
dibpat   No Yes
  Type B 30   4
  Type A 36   5

Note, that is also possible to check how many subjects we have in each one stratum.

# Subjects in each one of the strata
n_strata <- apply(mult_cont_tables, 3:5, sum)
n_strata
  1. 119
  2. 79
  3. 69
  4. 76
  5. 147
  6. 84
  7. 121
  8. 115
  9. 136
  10. 64
  11. 96
  12. 93
  13. 107
  14. 73
  15. 81
  16. 99
  17. 136
  18. 97
  19. 107
  20. 99
  21. 115
  22. 72
  23. 83
  24. 69
  25. 131
  26. 93
  27. 110
  28. 124
  29. 92
  30. 62
  31. 75
  32. 77

For each one of these 32 strata, we can obtain their corresponding estimates for the log-ORs, their SEs, and test statistics. Note that some strata show NaN, Inf, or -Inf; this happens because some contingency tables had cells with the value of zero as an absolute frequency. Hence, computations are not possible.

Attention

We can already see we are running into issues. The more variables we stratify, the smaller the subsample size within each stratum. Therefore, we might not have enough sampled subjects by strata!

est_strata_or <- apply(mult_cont_tables, 3:5, logOR_est)
round(est_strata_or, 3)
  1. Inf
  2. NaN
  3. 1.634
  4. 0.611
  5. -0.294
  6. 0.583
  7. 0.182
  8. 0.4
  9. 1.099
  10. NaN
  11. 0.321
  12. 0.635
  13. 0.775
  14. 0.405
  15. 2.253
  16. 0.361
  17. 0.972
  18. -Inf
  19. -0.405
  20. 0.981
  21. 1.643
  22. 0.405
  23. 0.889
  24. 0.734
  25. 0.209
  26. 1.245
  27. 2.099
  28. 1.887
  29. 0.74
  30. 0.511
  31. 0.041
  32. 0.934
se_strata <- apply(mult_cont_tables, 3:5, logOR_se)
round(se_strata, 3)
  1. Inf
  2. Inf
  3. 1.183
  4. 1.247
  5. 0.75
  6. 0.94
  7. 0.786
  8. 0.577
  9. 1.168
  10. Inf
  11. 0.739
  12. 0.865
  13. 0.758
  14. 1.248
  15. 1.072
  16. 0.554
  17. 0.884
  18. Inf
  19. 0.934
  20. 0.71
  21. 1.097
  22. 0.694
  23. 0.627
  24. 0.857
  25. 0.836
  26. 0.865
  27. 1.079
  28. 0.771
  29. 0.892
  30. 0.779
  31. 0.715
  32. 0.632
# Which strata have no empty cells?
no_empty_strata <- (as.vector(se_strata) < Inf)
sum(no_empty_strata)
28

For those 28 strata where computations were possible, we will obtain their corresponding 95% confidence intervals (CIs). Plotting these CIs will identify how many log-OR estimates were statistically significant out of those 28.

CI_strata <- matrix(0.0, length(se_strata), 2)
sig_strata <- rep(FALSE, length(se_strata))

z_alpha <- qnorm(0.975)

# Compute 95% CIs
for (i in 1:length(no_empty_strata)) {
  if (no_empty_strata[i]) {
    CI_strata[i, ] <- est_strata_or[i] + z_alpha * c(-1, 1) * se_strata[i]
    sig_strata[i] <- ((CI_strata[i, 1] > 0) || (CI_strata[i, 2] < 0))
  }
}
CI_strata
A matrix: 32 × 2 of type dbl
0.00000000.000000
0.00000000.000000
-0.68442513.952686
-1.83307033.054888
-1.00957932.488914
-1.01669872.038350
-1.36039611.442040
-0.30427432.172893

The plot below shows the 95% CIs for the 28 estimated log-ORs (i.e., those strata whose contingency tables had absolute frequencies different from zero across all their cells). We arrange the CIs from left to right in increasing order by stratum size (i.e., number of subjects by stratum). The dots correspond to the estimated log-ORs. We colour the CIs by significance status (i.e., whether the interval includes zero or not).

strata_log_OR_95_CI <- data.frame(
  logORest = est_strata_or[no_empty_strata],
  no_empty_strata = n_strata[no_empty_strata],
  CI = CI_strata[no_empty_strata, ], Significant = sig_strata[no_empty_strata]
) %>%
  ggplot(aes(x = no_empty_strata, y = logORest, colour = Significant)) +
  geom_pointrange(aes(ymin = CI.1, ymax = CI.2)) +
  ylab("Estimated log-OR") +
  xlab("Stratum Size") +
  geom_hline(yintercept = 0, linetype = "dotted", linewidth = 1.3) +
  theme(
    plot.title = element_text(size = 24, face = "bold"),
    axis.text = element_text(size = 17),
    axis.title = element_text(size = 21),
    legend.position = "right",
    legend.title = element_text(size = 21, face = "bold", margin = margin(r = 1, unit = "cm")), 
    legend.text = element_text(size = 17, margin = margin(r = 1, unit = "cm")),
    legend.key.size = unit(2, 'cm')
  ) +
  ggtitle("95% Confidence Intervals for Estimates of Log-OR by Stratum") +
  scale_color_brewer(palette = "Dark2")
strata_log_OR_95_CI
../_images/0b36081a716bfc8db8e95943f991fcfbf7eac129a012f2b0550b3306582e2690.png

What is the main takeaway from this CI plot?

Only two strata had significant results in terms of the estimated log-OR! Moreover, we can see some other strata were close to being significant (but not yet!).

Exercise 22

Is there anything invalid from a statistical standpoint? Or, if it is not invalid, is there anything that we might want to do differently?

4.3. Causal Inference via an Overall Binomial Logistic Regression#

The overall Binomial Logistic regression would be our more clever approach. It will include the binary variables \(X\) and \(Y\) and the corresponding confounders \(C_j\) AS STRATA. This is our proxy for blocking in a randomized experiment! Note all regressors are modelled are standalone variables.

bin_log_model <-  glm(chd69 ~ dibpat + age_bins + bmi_bins + smoke, family = "binomial")
tidy(bin_log_model, conf.int = 0.95) %>% 
  mutate(exp.estimate = exp(estimate)) %>%
  mutate_if(is.numeric, round, 3)
A tibble: 9 × 8
termestimatestd.errorstatisticp.valueconf.lowconf.highexp.estimate
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
(Intercept) -4.2410.246-17.2150.000-4.739-3.7720.014
dibpatType A 0.7300.144 5.0840.000 0.452 1.0152.075
age_bins(42,45] 0.1900.230 0.8260.409-0.266 0.6381.209
age_bins(45,50] 0.6870.197 3.4830.000 0.304 1.0791.989
bmi_bins(23,24.4] 0.4130.2081.9810.0480.0060.8261.511
bmi_bins(24.4,25.8]0.5690.2052.7780.0050.1710.9751.766
bmi_bins(25.8,38.9]0.7960.1994.0080.0000.4121.1922.217
smokeYes 0.7040.1395.0500.0000.4330.9792.021

Using the same log-OR plot with the 95% CIs, we illustrate the 95% CI for the regression coefficient estimate dibpatType A (as dashed horizontal blue lines). The estimate corresponds to the solid horizontal blue line (it is significant!). If we compare this blue 95% CI with the other 26 95% CIs, we get a more precise estimate of \(X\) with this overall Binomial Logistic regression model.

strata_log_OR_95_CI + geom_hline(yintercept = bin_log_model$coeff[2], colour = "blue") +
  geom_hline(yintercept = confint(bin_log_model)[2, ], linetype = "dashed", colour = "blue", linewidth = 1)
Waiting for profiling to be done...
../_images/0e5e5d6d297ede01a4926a6e40bcb5c613b16fe3291f6d5b7b6ed971f9444cc6.png

If we REALLY want to put a causal spin on our estimate for the \(X\) coefficient, what must we assume?

By controlling for possible confounders such as age, BMI, and smoking status in our regression model; we are assuming that no matter what age, smoking status, or BMI this population of interest has, all subjects have the same effect of \(X\) over \(Y\):

Being a middle-aged man between the ages of 39 and 59 (and an employee of a Californian company) with a Type A behaviour (when compared to Type B) LEADS to being 2.075 times more likely to experience a CHD event.

However, unlike a stratum-specific inference with multiple confounders, this class of causal conclusions must take various assumptions into account. Hence, bearing in mind that we fit

glm(chd69 ~ dibpat + age_bins + bmi_bins + smoke, family = "binomial")

we will explore these assumptions.

4.4. Assumptions for the Causal Overall Binomial Logistic Regression#

The big idea of our bin_log_model is that this model is enough to assess causality of \(X\) on \(Y\). Nevertheless, this idea involves the following strong assumptions:

  1. There is a simple/smooth structure in how the \(Y\)-specific log-OR varies across the strata.

  2. The strength of the \((X, Y)\) association within each of the 32 strata defined by the confounders is the same (i.e., we DO NOT HAVE any double interactions between \(X\) and each confounder).

  3. This example assumes we have included all of our confounding variables in our regression model. Hence, there are only three for this model: age, smoking status, and BMI.

4.4.1. Stress test of the FIRST assumption (simple structure)#

We will use model selection to assess this assumption. Since our bin_log_model is a maximum likelihood-based approach, we can the run a likelihood ratio test (LRT) to assess goodness of fit between two models: \(\text{model 1}\) and \(\text{model 2}\):

  • \(\text{model 1}\): It is a simpler model. This model is nested in \(\text{model 2}\).

  • \(\text{model 2}\): It is a more complex model.

The hypotheses are:

\[\begin{split}\begin{gather*} H_0 \text{: model 1 fits the data better than model 2} \\ H_a \text{: otherwise} \end{gather*}\end{split}\]

For this assumption, the complex model will be one with all possible interactions between the confounders up to the triple one:

triple_bin_log_model <- glm(chd69 ~ dibpat + age_bins * bmi_bins * smoke, family = "binomial")
tidy(triple_bin_log_model) %>% mutate_if(is.numeric, round, 3)
A tibble: 33 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -4.447 0.719-6.1830.000
dibpatType A 0.732 0.144 5.0690.000
age_bins(42,45]-13.404440.895-0.0300.976
age_bins(45,50] 1.315 0.882 1.4910.136
age_bins(50,59]:bmi_bins(24.4,25.8]:smokeYes -0.384 1.281-0.3000.764
age_bins(42,45]:bmi_bins(25.8,38.9]:smokeYes-13.273440.896-0.0300.976
age_bins(45,50]:bmi_bins(25.8,38.9]:smokeYes 1.387 1.292 1.0740.283
age_bins(50,59]:bmi_bins(25.8,38.9]:smokeYes -0.109 1.250-0.0880.930

Note the estimate for dibpatType A is similar compared to bin_log_model (both estimates and their standard errors are practically equal!):

tidy(bin_log_model) %>% mutate_if(is.numeric, round, 3)
A tibble: 9 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -4.2410.246-17.2150.000
dibpatType A 0.7300.144 5.0840.000
age_bins(42,45] 0.1900.230 0.8260.409
age_bins(45,50] 0.6870.197 3.4830.000
bmi_bins(23,24.4] 0.4130.2081.9810.048
bmi_bins(24.4,25.8]0.5690.2052.7780.005
bmi_bins(25.8,38.9]0.7960.1994.0080.000
smokeYes 0.7040.1395.0500.000

Then, we use the function anova() with the parameter test = "LRT". Note that we will be storing different \(p\)-values in vector p_values.

anova(bin_log_model, triple_bin_log_model, test = "LRT")  %>% mutate_if(is.numeric, round, 3)
p_values <- c(anova(bin_log_model, triple_bin_log_model, test = "LRT")$`Pr(>Chi)`[2]) # Storing p-value
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
130921628.780NA NA NA
230681599.8362428.9440.222

Given the \(p\)-value above for the LRT, our simple model suffices.

4.4.2. Stress tests of the SECOND assumption (no interactions between \(X\) and the confounders)#

For the second assumption, we will fit three other models. Each of them will include the terms of our simple model bin_log_model plus one double interaction between \(X\) and a given confounder. Then, we run the corresponding LRT via anova(). Given the \(p\)-values below for the LRT, our simple model suffices.

int_bin_log_model <-  glm(chd69 ~ dibpat + smoke + age_bins + bmi_bins + dibpat:age_bins, family = "binomial")
anova(bin_log_model, int_bin_log_model, test = "LRT")  %>% mutate_if(is.numeric, round, 3)
p_values <- c(p_values, anova(bin_log_model, int_bin_log_model, test = "Chi")$`Pr(>Chi)`[2]) # Storing p-value
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
130921628.780NA NA NA
230891627.929 30.8520.837
int_bin_log_model_2 <-  glm(chd69 ~ dibpat + smoke + age_bins + bmi_bins + dibpat:smoke, family = "binomial")
anova(bin_log_model, int_bin_log_model_2, test = "LRT") %>% mutate_if(is.numeric, round, 3)
p_values <- c(p_values, anova(bin_log_model, int_bin_log_model_2, test = "LRT")$`Pr(>Chi)`[2]) # Storing p-value
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
130921628.780NA NA NA
230911627.647 11.1330.287
int_bin_log_model_3 <-  glm(chd69 ~ dibpat + smoke + age_bins + bmi_bins + dibpat:bmi_bins, family = "binomial")
anova(bin_log_model, int_bin_log_model_3, test = "LRT") %>% mutate_if(is.numeric, round, 3)
p_values <- c(p_values, anova(bin_log_model, int_bin_log_model_3, test = "LRT")$`Pr(>Chi)`[2]) # Storing p-value
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
130921628.780NA NA NA
230891627.111 31.670.644

4.4.3. Stress test of the THIRD assumption (no additional confounders are necessary)#

Suppose we want to use the subject’s cholesterol levels (chol) as an additional confounder besides the three ones already included in the simple bin_log_model. We will create another quartile strata for chol and fit another model called chol_bin_log_model. Then we run the corresponding LRT.

detach(training_data)

training_data <- training_data %>%
  mutate(chol_bins = cut(chol, breaks = c(min(chol), quantile(chol, (1:3) / 4), max(chol)), include.lowest = TRUE))
levels(training_data$chol_bins)
  1. '[103,197]'
  2. '(197,223]'
  3. '(223,253]'
  4. '(253,645]'
attach(training_data)

chol_bin_log_model <- glm(chd69 ~ dibpat + age_bins + smoke + bmi_bins + chol_bins, family = "binomial")
anova(bin_log_model, chol_bin_log_model, test = "LRT") %>% mutate_if(is.numeric, round, 3)
p_values <- c(p_values, anova(bin_log_model, chol_bin_log_model, test = "LRT")$`Pr(>Chi)`[2]) # Storing p-value
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
130921628.780NA NANA
230891580.257 348.523 0

Note that in this case, we reject \(H_0\) in our LRT. Hence we need to consider chol in our logistic regression model! Therefore, bin_log_model DOES NOT fulfil this assumption.

Testing the third assumption is challenging when assessing causation in observational studies involving many different confounders, such as the WCGS case. Hence, we can think about a practical solution: automating the model building process and selection (confounder by confounder) while testing the three assumptions.

Attention

The above idea is a crucial part of your work as a Data Scientist when doing model building and selection in observational studies to detect causation. Moreover, this process involves multiple testing! Hence, you need to use the concepts of Lecture 1 - Multiple Comparisons. This is the reason we stored all the previous p_values.

adj_p_values <- matrix(data = p.adjust(p_values, method = "bonferroni"), nrow = 1, ncol = length(p_values))
adj_p_values
A matrix: 1 × 5 of type dbl
11118.239119e-10

The last adjusted \(p\)-value from left to right involves the LRT of bin_log_model versus chol_bin_log_model. We would need to proceed with chol_bin_log_model and restart the model checking since the first assumption. Testing the third assumption will involve an additional confounder from wcgs.

5. Wrapping Up#

We have the following parting thoughts for today:

  • We have made a connection between stratification and regression.

  • In some contexts, we build regression models with prediction as the main goal.

  • In others (like today!), we build regression models with explanation as the main goal.

  • For causal interpretations, various assumptions must be met. These assumptions are pretty strong!

  • Some of them are testable/checkable (typically having to do with appropriate model specification issues like smoothness, lack of interactions between the variable of interest and confounders, etc.).

  • We almost always have at least one untestable assumption, e.g., no unobserved confounders.