Chapter 1 Mathematical Intuition

Markov Chains are cool! Hidden Markov Models are also cool, but require more preparation! In this section, we’ll go through conditional probabilities & set up the basis to study Hidden Markov Models by getting comfortable with chains first.



1.1 Conditional Probability & Bayes Rule

Basic Concepts

Probability, as a field, formalizes how we predict events with some equally beautiful & ugly notation, intuitive concepts, and complex mathematical principles. But the premise is simple: by ascribing a numeric value to the outcomes of an event, we can abstract the real world and study it with math.

The process of ascribing numeric values to the outcome of an event is called mapping & by mapping all possible probabilities of an event’s outcomes, we create a random variable.

NOTE: This can be confusing! The “random” part of the word doesn’t mean all outcomes have an equal chance of happening; really, it means that within an event, there are multiple possible outcomes.


Example: Weather & Temperature

Let’s say that the weather is the event, \(W\), whose only outcomes are sunny (\(w_s\)), rainy (\(w_r\)) or cloudy(\(w_c\)). By mapping numeric values (e.g., probabilities) to the outcomes, we can turn \(W\) into a random variable. Below, we list all mappings in the probability mass function, \(p_{(W)}(w)\).

\[\begin{align*} p_{(W)}(w) &= \begin{cases} 0.5, & \text{for } w_s \\ 0.2, & \text{for } w_r \\ 0.3, & \text{for } w_c \\ 0.0, & \text{otherwise} \end{cases} \end{align*}\]


NOTE: The sub-probability of all probability mass functions must sum to 1.


Now, let’s see that the daily temperature (F°) is the event \(T\). Normally, we could say that \(T\) follows a normal distribution with \(\mu = 75 \textbf{ F°}\) and \(\sigma^2 = 625 \textbf{ F°}\), since its a continuous variable. So, \(T \sim N(75, 625)\) and would look something like this


Instead, let’s say that \(T\) is a discrete random variable whose only potential outcomes are cold (\(t_c\)), fresh (\(t_n\)), or hot (\(t_h\)). Then, \(T\) has the probability mass function

\[\begin{align*} p_{(T)}(t) &= \begin{cases} 0.155, & \text{for } t_c \\ 0.5, & \text{for } t_f \\ 0.345, & \text{for } t_h \\ 0.0, & \text{otherwise} \end{cases} \end{align*}\]


Below are two graphs summarizing what we know so far about \(W\) and \(T\)

But what happens if the weather depends on the temperature?



Conditional Probabilities

Let’s study the two arbitrary events \(A\), \(B\) & learn some definitions about probability.

Conditional Probability: The conditional probabilities of \(A\)’s, given \(B\) and \(B\), given \(A\) are written below. \[P(A \mid B) \hspace{1 in}P(B \mid A)\]

Independence: We say that the two events, \(A\) & \(B\) are independent if the conditional probabilities provide us no new-information about either event. So,

\[P(A \mid B) = P(A) \hspace{1 in}P(B \mid A) = P(B)\]

Joint Probability: The probability of two events, \(A\) & \(B\) happening at the same time is called a joint probability and is typically denoted by \(P(A\cap B)\). It is calculated below.

\[\begin{align*} P(A\cap B) &= P(A \mid B) \cdot P(B)\\ &= P(A) \cdot P(B) && \text{if independent} \end{align*}\]



It’s generally true that the weather on a particular day, depends on the temperature. This implies that \(W\) and \(T\) are conditional events with conditional probabilities. So,

\[P(W\mid T) \neq P(W)\]

Bayes’ Rule & LOTP

Conceived by Reverend Thomas Bayes in the 18th Century, posthumously published by his friend Richard Price, and then formalized into an equation by Pierre-Simon Laplace, Bayes’ Rule is a cornerstone equation in modern statistics & probability1. I’ve written Bayes Rule below2

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}=\frac{P(A)\cdot P(B \mid A)}{P(B)}\]

NOTE: Above you’ll notice that we’re looking at the probability of \(A\) & \(B\), divided by the probability of \(B\). We are dividing by \(P(B)\) to normalize & isolate the probability of \(A\), under the conditions we observe \(B\).

Quick Check: If \(A\) & \(B\) are independent, how would you further simplify the numerator of Bayes’ Rule?

The denominator of Bayes’ Rule, \(P(B)\) is called the marginal probability of \(B\).

The marginal probability can either be

  • given
  • computed with the law of total probability or (LOTP).

LOTP states that

\[\begin{align*} P(B) = \sum_{i=1}^n P(B \cap A_i) &= P(B \cap A_1) + P(B\cap A_2) + \dots + P(B\cap A_n) \\ &= P(B \mid A_1)P(A_1) + P(B\mid A_2)P(A_2) + \dots + P(B\mid A_n)P(A_n) \\ \end{align*}\]

Note: Looks scary! Really it’s like calculating the probability of \(B\) in each subcategory of \(A\), multiplying by the probability of that sub-\(A\), then adding it all together.



1.2 Markov Chains

Stochastic processes are events that have some element of randomness in their outcomes. The amount of ‘randomness’ & the type of events can vary depending on the context. In turn, studying the properties of stochastic processes often requires many different techniques which go well beyond the boundaries of statistics. And with a litany of applications across so many domains of knowledge, studying stochastic processes also involves many techniques from Physics, Linguistics, Sociology, Public Health, Geography & more.

Note: There is some nuance in the language we use to describe randomness, probabilistic, and stochastic, but it is murky. So, for now, let’s stick with the above definition & enjoy some interchangeability between random, stochastic, and probabilistic.

So, this section will only be a tiny snippet of the wide topics covered in studying stochastic processes.

Intuition

Markov chains are a subclass of stochastic processes that describe partially random events occurring in succession, typically in succession. We will formalize this definition soon, but for now, let’s talk about the weather.

Example: Weather

Let’s ignore the fact that temperature determines weather. Instead, let’s assume that we can compute the probability of tomorrow’s weather by only looking at today’s. This results in three important ideas:

  • General weather patterns before today are irrelevant when predicting tomorrow’s weather. This is an example of a Markov process, more specifically a Markov Chain.
  • The outcomes of the weather are rainy, sunny, or cloudy. These are examples of states.
  • The probability of tomorrow’s weather depends on whether it was rainy, sunny, or cloudy to day. Meaning there are probabilities associated with tomorrow’s events, strictly defined by today’s. These are examples of transition probabilities.

Let’s be rigorous now!

Mathematical Definitions

Let \(X_i = X_1, \dots, X_{n-1}, X_n\) be a collection of \(n\) successively indexed events.

The discrete \(X_i\) are a Markov Chain if

  • They exhibit a Markov Property.
    • Where the probability of some new or predicted event \(X_{n+1}=x_{n+1}\) is

\[P(X_{n+1} = x_{n+1} \mid X_1 = x_1, \dots, X_{n-1}= x_{n-1}, X_n= x_{n}) = P(X_{n+1} = x_{n+1} \mid X_n = x_n)\]

NOTE: This is similar to the conditional probability of independent events! But instead we specify that prior events \(X_1,\dots X_{n-1}\) are independent of the outcome of \(X_{n+1}\), but \(X_{n}\) is not.


  • There is some countable set of outcomes called the State Space which contains every possible outcome or State
    • The outcomes \(x_1, \dots, x_{n-1}, x_{n}, x_{n+1}\) are all common elements of the state space, \(\mathbb{S}\).

NOTE: This may seem complicated, but this means we can define what are possible and impossible outcomes.


  • The probability of changing states is a Transition Probability and if we were to write them out for each state, they would sum to \(1\).
    • Transition Probabilities sum to 1 because they cumulatively define the probability mass function of the transition from each state to another.
    • These probabilities can be placed in a Transition Matrix where the columns indicate a next state & the rows indicate the current state.

NOTE: The above definitions were cumulatively drawn from the following sources 3 4


Visualizations

Weather Graphs

Below we’ve redefined our weather example as a Markov Chain, using our new definitions, and place it into a Weighted Directed Graph. Isn’t she pretty! Note that the weights on our arrows correspond to the transition probability associated with that change-arrow.

Also, check out that Transition Matrix! It’s the first one you’ve looked at, but they can quickly get very complex.


Weather Animations

To simulate how a Markov Chain of weather behaves, let’s animate it!

In this visualization, there are 10 multicolored dots representing different arbitrary days colored by that particular day’s weather. To see how Markov Chains evolve over 15 days, we shuffle them 15 times in a row according to our transition matrix

Rainy Sunny Cloudy
.20 .46 .34
.00 .80 .20
.15 .50 .30

Written text describes the proportion of weather outcomes following simulation. That is, how many rainy, sunny, or cloudy days happen after a given shuffle.

Look at that! We converged on our initial probability distribution. That isn’t necessarily true for all simulations, but it’s true for this one.

NOTE: This Markov Chain visualization was designed by Will Hipson, a graduate student in Psychology at Carleton University. You can find links for reproduction at his page5 or check out my GitHub to see my edits.



1.3 Conclusion

We’ve covered the necessary probability concepts. In the next section, we’ll find out out how we can leverage Markov Chains to predict another set of variables, even when we can’t see the outcomes of our chain. Prepare for bivariate distributions, many subscripts, and a lot of summation notation!

If you have any lingering questions, I’ve linked some great YouTube videos that may be helpful below.

Video Resources

Conditional Probability

Bayes Rule

Markov Chains

References