Lab 1: Binary Logistic Regression

Setup

We will need to load the following packages before proceeding.

The Titanic Dataset

Let us start with a review of lecture8 from DSCI 561 on Binary Logistic regression. This warmup exercise will help us refresh our learning from the previous block on this regression model while implementing the Data Science-based steps from this course’s lecture1.

That said, let us get on the historical data side and this GLM to address inferential inquiries. Specifically, we will use the Titanic dataset available via the package carData. According to the corresponding documentation, this is the description of the data frame TitanicSurvival:

Information on the survival status, sex, age, and passenger class of 1309 passengers in the Titanic disaster of 1912.

Note

passengerClass is a factor-type ordinal variable (i.e., there is a hierarchy associated to its corresponding levels as shown below) which would require proper use of contrasts in Regression Analysis if we use it as a regressor. We will address this topic until DSCI 554; thus we will not use it in the meantime.

Main Statistical Inquiries

Suppose you are a Ship Historian turned into a Data Scientist. Hence, statistically speaking, you are interested in determining the following:

  • Was sex a significant factor in surviving the ship sinking? Can we quantify this association? If so, by how much?
  • Was age a significant factor in surviving the ship sinking? Can we quantify this association? If so, by how much?

Heads-up: Note these are inferential inquiries, not predictive.

Data Collection and Wrangling

Given the previous inquiries, we will specifically work on the following columns:

  • survived: Whether the passenger survived (yes) or not (no).
  • sex: The sex of the passenger: female or male.
  • age: The age of the passenger in years.
Important

Note this is a dataset from 1912. Hence, the variable sex (i.e., gender in the context of this case) was recorded as binary.

Thus, let us select the columns of interest:

Q1.1. Exploratory Data Analysis

Throughout our inferential courses in MDS, we will emphasize the importance of conducting a proper exploratory data analysis (EDA). Based on our main statistical inquiries, let us answer the following:

Q1.1.1. What is our response of interest?

Q1.1.2. What is the response’s nature?

Q1.1.3. What are our explanatory variables of interest?

Q1.1.4. What is the nature of the explanatory variables?

Answers

Available for MDS students.

Now, let us code suitable plots comparing age on the \(y\)-axis by each level in survived on the \(x\)-axis, which has to be facetted by sex.

Q1.1.5. What do you observe about the relationship of age and sex on survived?

Answer

Available for MDS students.

Q1.2. Data Modelling Framework

Before proceeding with the proper data modelling framework, suppose we err on the naive side and treat the categories of survived as probabilities (i.e., we transformed no as 0 and yes as 1).

Let us use geom_point() to plot these “survivorship probabilities” from survived on the \(y\)-axis versus age on the \(x\)-axis, along with an Ordinary Least-squares (OLS) regression model estimated line on top with geom_smooth().

Q1.2.1. Based on the results in Titanic_scatterplot, would this be a good model to use? Why or why not?

Answer

Available for MDS students.

Given the previous question, let us set up a more adequate data modelling framework. Firstly, for \(i = 1, \dots, n\), let us mathematically define the response of interest in our training set of size \(n\):

\[ Y_i = \begin{cases} 1 \; \; \; \; \text{if the passenger survived},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \]

Furthermore, this \(i\)th response is assumed as:

\[Y_i \sim \text{Bernoulli}(p_i),\]

where \(p_i\) is the survival probability of the \(i\)th passenger.

Q1.2.2. Having defined the response (and its distributional assumption) and the explanatory variables in this case, what is the right modelling framework?

Answer

Available for MDS students.

Because OLS regression of survived versus age might be problematic, let us try the proposed model from Q1.2.2. Firstly, we will show its corresponding fitted values of this proposed model on Titanic_scatterplot.

Q1.2.3. Graphically, what differences do you see in between the OLS and your proposed regression model?

Answer

Available for MDS students.

Q1.2.4. Having defined a more appropriate model in Q1.2.2., let us expand our set of regressors to more than the observed age, \(X_{i, \texttt{age}}\). Hence, we will include sex as follows:

\[ X_{i, \texttt{sex}} = \begin{cases} 1 \; \; \; \; \text{if the passenger is female},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \]

Important

Category male is the baseline level under this framework for \(X_{i, \texttt{sex}}\).

Now that we already defined the response of interest, our regressors, and a more adequate model. What is the training sample’s modelling equation? What is its main characteristic on the left-hand side?

Answer

Available for MDS students.

Q1.3. Estimation

It is time to estimate the aforementioned model via the training set TitanicSurvival of survived versus age and sex.

Important

Make sure that male is the baseline level in the regressor sex.

Answer

Q1.4. Inference

Once we have fitted our regression model, it is time to assess whether age and sex are statistically significant factors in surviving the ship sinking (the first main inquiry!). Thus, we need to carry out the corresponding hypothesis tests using the output from titanic_model.

Answer

Now, state your conclusion with a significance level \(\alpha = 0.05\).

Answer

Available for MDS students.

Q1.5. Coefficient Interpretation

Via your model fitted and stored in titanic_model, interpret the estimated coefficients associated to those significant explanatory variables (i.e., the second main inquiry). Use an adequate response scale.

Answer

Available for MDS students.