# Lecture 1 Depicting Uncertainty

*September 9, 2019*

Welcome to the course! The **syllabus** is on the README of the `_students`

repo.

Today’s topics: probability, followed by distributions.

## 1.1 Lecture Learning Objectives

From today’s lecture, students are expected to be able to:

- Identify probability as a proportion that converges to the truth as you collect more data.
- Calculate probabilities using the inclusion-exclusion principle, the law of total probability, and probability distributions.
- Convert between and interpret odds and probability.
- Specify the usefulness of odds over probability.
- Be aware that probability has multiple interpretations/philosophies.
- Calculate and interpret mean, mode, entropy, variance, and standard deviation, from both a distribution and a sample.

(Hint: we make the quizzes based on lecture learning objectives)

## 1.2 Thinking about Probability

### 1.2.1 Defining Probability (5 min)

I like to play Mario Kart 8, a racing game with some “combat” involved using items. In the game, you are given an item at random whenever you get an “item box”.

Suppose you’re playing the game, and so far have gotten the following items in total:

Item | Name | Count |
---|---|---|

Banana | 7 | |

Bob-omb | 3 | |

Coin | 37 | |

Horn | 1 | |

Shell | 2 | |

Total: | 50 |

Attribution: images from pngkey.

Questions that we’ll address:

- What’s the probability that your next item is a coin?
- How would you find the
*actual*probability? - From this, how might you define probability?

In general, the probability of an event \(A\) occurring is denoted \(P(A)\) and is defined as \[\frac{\text{Number of times event } A \text{ is observed}}{\text{Total number of events observed}}\] as the number of events goes to infinity.

### 1.2.2 Calculating Probabilities using Logic

We’ll look at two laws for calculating probabilities of events. Suppose the table below show the true probabilities of each item. Also, let’s add some properties to these items.

Item | Name | Probability | Combat Type | Defeats blue shells | |
---|---|---|---|---|---|

Banana | 0.12 | contact | no | ||

Bob-omb | 0.05 | explosion | no | ||

Coin | 0.75 | ineffective | no | ||

Horn | 0.03 | explosion | yes | ||

Shell | 0.05 | contact | no |

Disclaimer: I don’t think these are the true probabilities, but I’m pretty sure the coin probability is correct, as long as you’re in the lead.

#### 1.2.2.1 Law of Total Probability (5 min)

- According to this table, are there any other items possible? Why or why not?
- What’s the probability of getting something other than a coin? How did you arrive at that number?

Concept: When partitioning the *sample space* (= the set of all possibilities), the probabilities of each piece should add to one. That is, in this case, \[1 = P(\text{Banana}) + P(\text{Bob-omb}) + P(\text{Coin}) + P(\text{Horn}) + P(\text{Shell}).\]

A special case of this involves the *complement* of an event. This partitions the sample space into two – for example, getting a coin or not a coin. For a general event \(A\), the law becomes: \[1 = P(A) + P(\neg A),\] where \(\neg\) means the complement (read “not”).

#### 1.2.2.2 Inclusion-Exclusion (5 min)

Let’s answer these questions by counting:

1. What’s the probability of getting an item that has an explosion combat type?

2. What’s the probability of getting an item that is both an explosion item *and* defeats blue shells?

This is written \(P(\text{explosion} \cap \text{defeats blue shells})\), where \(\cap\) means “and”.

3. What’s the probability of getting an item that is an explosion item *or* an item that defeats blue shells?

This is written \(P(\text{explosion} \cup \text{defeats blue shells})\), where \(\cup\) means “or”.

In general, we can answer the third question with the *inclusion-exclusion* principle: for events \(A\) and \(B\), \[P(A \cup B) = P(A) + P(B) - P(A \cap B).\]

We can extend this to three events, too: \[P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(B \cap C) - P(A \cap C) + P(A \cap B \cap C).\]

### 1.2.3 Comparing Probabilities (8 min)

True or False (2-3 min):

Suppose Vincenzo often wins at a game of solitaire, but that Tom is twice as good as Vincenzo. This means that \(P(\text{Tom wins}) = 2 \times P(\text{Vincenzo wins})\).

Probability is quite useful for communicating the chance of an event happening in an absolute sense, but is not useful for comparing probabilities. Odds, on the other hand, are useful for comparing the chance of two events. If \(p\) is the chance that Vincenzo wins at solitaire, his *odds of winning* is defined as \[\text{Odds} = \frac{p}{1-p}.\] This means that, if his odds are \(o\), then the probability of winning is \[\text{Probability} = \frac{o}{o+1}.\]

For example, if Vincenzo wins 80% of the time, his odds are \(0.8/0.2 = 4\). This is sometimes written as 4:1 odds – that is, *four wins for every loss*. If Tom is twice as good as Vincenzo, it’s *most useful* to say that this means Tom wins twice as many games before experiencing a loss (on average) – that is, 8:1 odds, or simply 8, and a probability of \(8/9=0.888\ldots\).

### 1.2.4 Interpreting Probability (5 min)

Thought experiment:

- What’s the probability of seeing a 6 after rolling a die?
- I roll a die, and cover the outcome. What’s the probability of seeing a 6 after I uncover the face?

No philosophy is “wrong”! But why is this relevant in practice?

- It often doesn’t actually make sense to talk about
*the*probability of an event, such as the probability that a patient has a particular disease. Instead, it’s a belief system that can be modified. - It influences our choice of whether we choose a
*Bayesian*or*Frequentist*analysis. More on this later in MDS.

## 1.3 Probability Distributions

So far, we’ve been discussing probabilities of single events. But it’s often useful to characterize the full “spectrum” of uncertainty associated with an outcome. The set of all outcomes and their corresponding probabilities is called a **probability distribution** (or, often, just **distribution**).

The outcome itself, which is uncertain, is called a **random variable**. (Note: technically, this definition only holds if the outcome is *numeric*, not categorical like our Mario Kart example, but we won’t concern ourselves with such details)

When the outcomes are *discrete*, the distributions are called **probability mass functions** (or *pmf*’s for short).

### 1.3.1 Examples of Probability Distributions (3 min)

**Mario Kart Example**:

The distribution of items is given by the following table:

Item | Name | Probability |
---|---|---|

Banana | 0.12 | |

Bob-omb | 0.05 | |

Coin | 0.75 | |

Horn | 0.03 | |

Shell | 0.05 |

The distribution of combat type is given by the following table:

Combat Type | Probability |
---|---|

contact | 0.17 |

explosion | 0.08 |

ineffective | 0.75 |

The distribution of defeating blue shells is given by the following table:

Defeats blue shells | Probability |
---|---|

no | 0.97 |

yes | 0.03 |

**Ship example (New)**:

Suppose a ship that arrives at the port of Vancouver will stay at port according to the following distribution:

Length of stay (days) | Probability |
---|---|

1 | 0.25 |

2 | 0.50 |

3 | 0.15 |

4 | 0.10 |

The fact that the outcome is *numeric* means that there are more ways we can talk about things, as we will see.

### 1.3.2 Measures of central tendency and uncertainty

(3 min)

There are two concepts when communicating an uncertain outcome:

**Central tendency**: a “typical” value of the outcome.**Uncertainty**: how “random” the outcome is.

There are many ways to *measure* these two concepts. They’re defined using a probability distribution, but just as probability can be defined as the limit of a fraction based on a sample, these measures often have a *sample version* (aka *empirical version*) from which they are derived.

As such, let’s call \(X\) the random outcome, and \(X_1, \ldots, X_n\) a set of \(n\) *observations* that form a *sample* (see the terminology page for alternative uses of the word *sample*).

#### 1.3.2.1 Mode and Entropy (5 min)

No matter what scale a distribution has, we can always calculate the mode and entropy. And, when the outcome is categorical (like the Mario Kart example), we are pretty much stuck with these as our choices.

The **mode** of a distribution is the outcome having highest probability.

- A measure of central tendency.
- The sample version is the observation you saw the most.
- Measured as an
*outcome*, not as the probabilities.

The **entropy** of a distribution is defined as \[-\displaystyle \sum_x P(X=x)\log(P(X=x)).\]

- A measure of uncertainty.
- Probably the only measure that didn’t originate from a sample version (comes from information theory).
- Measured as a transformation of probabilities, not as the outcomes – so, hard to interpret on its own.
- Cannot be negative; zero-entropy means no randomness.

#### 1.3.2.2 Mean and Variance (10 min)

When our outcome is numeric, we can take advantage of the numeric property and calculate the *mean* and *variance*:

The **mean** (aka expected value, or expectation) is defined as \[\displaystyle \sum_x x\cdot P(X=x).\]

- A measure of central tendency, denoted \(E(X)\).
- Its sample version is \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i,\) which gets closer and closer to the true mean as \(n \rightarrow \infty\) (this is in fact how the mean is originally defined!)
- Useful if you’re wanting to compare
*totals*of a bunch of observations (just multiply the mean by the number of observations to get a sense of the total). - Probably the most popular measure of central tendency.
- Note that the mean might not be a possible outcome!

The **variance** is defined as \[E[(X-E(X))^2],\] or this works out to be equivalent to the (sometimes) more useful form, \[E[X^2]-E[X]^2.\]

- A measure of uncertainty, denoted \(\text{Var}(X)\).
- Yes! This is an expectation – of the squared deviation from the mean.
- Its sample version is \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2\), or sometimes \(s^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2\) – both get closer and closer to the true variance as \(n \rightarrow \infty\) (you’ll be able to compare the goodness of these at estimating the true variance in DSCI 552 next block).
- Like entropy, cannot be negative, and a zero variance means no randomness.

- Unlike entropy, depends on the actual values of the random variable.

The **standard deviation** is the square root of the variance.

- Useful because it’s measured on the same scale as the outcome, as opposed to variance, which takes on squared outcome measurements.

Note: you may have heard of the **median** – we’ll hold off on this until later.