Chapter 2 Frog Families & HMMs

Previously, we wanted to study the weather. Let’s up the stakes and now look at how the weather can determine whether or not a cute, infinitely reproducing, frog-family survives the month.

If you were a frog, you wouldn’t be able to check the weather before leaving your tree hide– you wouldn’t even be able to read!– but you do know that when a family member leaves for their daily hop, their chance of living depends on the weather. If the weather is

Sunny: they would have a 10% chance of living (and eating a nice bug) with a 90% chance of dying.
Cloudy: they would have a 75% chance of living (and eating a nice bug) with a 25% chance of dying.
Rainy: they would have a 98% chance of living (and eating a nice bug) with a 2% chance of dying.

Since we, as non-frog statisticians, know that the weather is a Markov Chain, we can say the survival of this frog-family is actually a Hidden Markov Model. In the following sections, we will establish what that implies theoretically for us & our frog family.

2.1 Hidden Markov Models

Hidden Markov Models are another class of probabilistic models that model Markov processes whose outcomes cannot be directly observed, but their dependent observed events can be. Often, we’re really interested in these unobserved events and predicting them, so we can work backward and use the observed ones¹.

Intuitively, it’s like using the shadow of an animal to guess what it is!

Fig 1. Frog Shadow Made by Author

Properties

Let

$X_{i} = X_{1} = x_{1}, \dots, X_{n - 1} = x_{n - 1}, X_{n} = x_{n}$ be a collection of $n$ successively indexed events.
$Y_{i} = Y_{1} = y_{1}, \dots, Y_{n - 1} = y_{n - 1}, Y_{n} = y_{n}$ be a collection of $n$ successively indexed observations.
$λ = (X, Y)$ .

$λ$ is a discrete Hidden Markov Model if it fulfills the following definitions:

Discreteness:

Both $X_{i}$ and $Y_{i}$ are sequential, discrete, random variables with the $n$ observations $x_{1, \dots, n}$ and $y_{1, \dots, n}$ .

Note: Neither $X_{i}$ nor $Y_{i}$ must be discrete events, but this project is explicitly dedicated to discrete Hidden Markov Models, so we will assume they are.

Markov Assumptions:

The $X_{i}$ must be a Markov Chain.
- So, $P (X_{n} = x_{n} ∣ X_{1} = x_{1}, \dots, X_{n - 1} = x_{n - 1}, X_{n} = x_{n}) = P (X_{n} = x_{n} ∣ X_{n} = x_{n})$
The $Y_{i}$ exhibit a Markov property where the $Y_{k}^{t h}$ observation can be solely predicted using the $k^{t h}$ state of the $X_{i}$ , for some arbitrary $k$ .
- So, $Y_{i}$ observations are solely dependent on the state of the hidden $X_{i}$ and nothing else. Implying that
$P (Y_{n} ∣ Y_{1} = y_{1}, . . ., Y_{n} = y_{n}, X_{1} = x_{1}, \dots, X_{n} = x_{n}) = P (Y_{n} ∣ X_{n} = x_{n})$

Hidden States:

Hidden States: The unobservable possible outcomes of $X_{i}$ . Denoted as $x_{1}, \dots, x_{n - 1}, x_{n}$ .
Together, the outcomes form the Hidden State Space, which we will denote as $S$ .

Transition, Emission, and Initial Probabilities:

Transition Probability: The probability of changing $X_{i}$ -states. These are typically placed into a transition matrix.
- We typically denote this transition matrix as $a_{i j}$ , where we change from state $i$ to state $j$ .
Emission Probability: The probability of $Y_{i}$ , given the $X_{i}$ ’s state.
- So, $P (Y_{n} ∣ X_{n} = x_{n})$ is the Emission Probability which can be placed into the Emission Matrix
  - We typically denote these emission probabilities as $b_{n}$ , where $n$ is the state of the $X_{i}$ .
Initial Probability: An estimation of $X_{i}$ ’s state at some arbitrary point $a$ . This is often denoted by $π$ and is typically as a start guess.

NOTE: The above definitions were cumulatively drawn from the following sources ² ³ ⁴

Example: Frogs

Let’s formalize the frog-family’s situation, $F$ , with the above definitions!

Let $W_{i} = W_{1}, \dots, W_{n}$ be the weather for $n$ days. We know from previous work that the weather is a Markov Chain. So, for the frogs, the hidden states of $W_{i}$ are sunny, rainy, and cloudy, while the hidden state space will be $W$ .

From the previous sections, we also know that the transition matrix of $W_{i}$ is

	Rainy	Sunny	Cloudy
Rainy	.20	.46	.34
Sunny	.00	.80	.20
Cloudy	.15	.50	.30

And let our initial probabilities ( $π$ ) of $W$ be

Rainy	Sunny	Cloudy
.2	.5	.3

Finally, let $D_{i} = D_{1}, \dots, D_{n}$ be whether or not we see a frog die after leaving the tree. We know that the emission probabilities of the $D_{i}$ s are therefore

	$P (D_{i} ∣ Rainy)$	$P (D_{i} ∣ Sunny)$	$P (D_{i} ∣ Cloudy)$
Survives	.98	.10	.75
Dead	.02	.90	.25

NOTE: These are actually 3 separate emission matrices squashed together for simplicity. The columns contain the desired matrices.

In the next section, we’ll show some visual examples of this frog family’s HMM, $F$ .

Visualizations

Frog Graphs

Below, we’ve drawn the Hidden Markov Model, $F$ & placed it into a Weighted Directed Graph.

Fig 2. HMM Directed Graph. Made by Author

Frog Animations

We can reuse our previous animation to simulate state-space transitions and what that implies for our frog family’s chance of living.

Now, the dots change colors with the true but unobserved weather of each day. Written text now describes,

The emission probability of dying on a given day.
The number of unobserved, but true, weather states.

What a nice way of looking at the relationship between emission probabilities and the hidden states! In this next section, we’ll see how we model the hidden states using observations.

2.2 Algorithms

Let’s say the frog family has been really lucky, and all the frogs who left the tree hide in the past week came back! One very curious frog named Betsy wonders what the probability of all her family surviving is, given the weather.

We can answer Betsy’s questions using likelihood calculations or the Forward-Backward Algorithm.

Likelihood

Likelihood or Posterior Probability are commonly defined as the probability of the data, given a true underlying parameter⁵.

Let $W_{i}$ be the hidden states of the weather in the past week, and let $D_{i}$ be our observations on whether or not the frogs came back alive. This implies

$\begin{aligned} W_{i} & = {W_{1} = w_{1}, W_{2} = w_{2}, W_{3} = w_{3}, W_{4} = w_{4}, W_{5} = w_{5}, W_{6} = w_{6}, W_{7} = w_{7}} \\ W_{i} & = {w_{1}, w_{2}, w_{3}, w_{4}, w_{5}, w_{6}, w_{7}} \\ and \\ D_{i} & = {D_{1} = d_{1}, D_{2} = d_{2}, D_{3} = d_{3}, D_{4} = d_{4}, D_{5} = d_{5}, D_{6} = d_{6}, D_{7} = d_{7} \\ D_{i} & = {d_{1}, d_{2}, d_{3}, d_{4}, d_{5}, d_{6}, d_{7}} \end{aligned}$

In this case, the likelihood would be the probability of our observed $D_{i}$ , given some true underlying set of $W_{i}$ . This is computed below.

$\begin{aligned} P (D_{i} ∣ W_{i}) & = P (d_{1} ∣ w_{1}) \times P (d_{2} ∣ w_{2}) \\ \times P (d_{3} ∣ w_{3}) \times P (d_{4} ∣ w_{4}) \\ \times P (d_{5} ∣ w_{5}) \times P (d_{6} ∣ w_{6}) \\ \times P (d_{7} ∣ w_{7}) \\ P (D_{i} ∣ W_{i}) & = \prod_{i = 1}^{7} P (d_{i} ∣ w_{i}) \end{aligned}$

NOTE: The capital pi ( $\prod$ ) means to multiply a series indexed by $i$ .

If we knew the states of the past week were Sunny, Cloudy, Rainy, Rainy, Rainy, Rainy, Rainy and everyone came back alive. We’d compute the likelihood as

$\begin{aligned} P (D_{i} ∣ W_{i}) & = P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ C l o u d y) \\ \times P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \\ \times P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \\ \times P (L i v e s ∣ R a i n y) \\ P (D_{i} ∣ W_{i}) & = .10 \times .75 \times .98 \times .98 \times .98 \times .98 \times .98 \\ P (D_{i} ∣ W_{i}) & = 0.06779406 \\ P (D_{i} ∣ W_{i}) & \approx 0.068 \end{aligned}$

Bad News: We can’t truly know the states of the $W_{i}$ ! Bummer.

Because of this, we would need to add together all likelihoods for all possible states for each event, weighted by their probability⁶ . More succinctly, we intend to find the likelihood of all observations $P (D_{i})$ , given the probability of all states $P (D_{i} \cap W_{i})$ . We can do this with the rules of joint probability we covered in the previous section.

$\begin{aligned} P (D_{i} \cap W_{i}) & = P (D_{i} ∣ W_{i}) \times P (W_{i}) \\ = \prod_{i = 1}^{7} P (d_{i} ∣ w_{i}) \times P (W_{i}) & by above \\ = \prod_{i = 1}^{7} P (d_{i} ∣ w_{i}) \times \prod_{i = 1}^{7} P (w_{i} ∣ w_{i - 1}) & by Markov \end{aligned}$

Now, we sum up all the observations given.

$P (D_{i}) = \sum_{i = 1}^{n} P (D_{i} \cap W_{i}) = \sum_{i = 1}^{n} P (D_{i} ∣ W_{i}) \times P (W_{i})$

So, for our 7 consecutive live frogs, this would be

$\begin{aligned} P (D_{i}) & = P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \\ \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \\ + P (L i v e s ∣ C l o u d y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \\ \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \\ + P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ C l o u d y) \times P (L i v e s ∣ S u n n y) \\ \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \\ + P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ C l o u d y) \\ \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \times P (L i v e s ∣ S u n n y) \\ + \dots \\ = P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \\ \times P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \times P (L i v e s ∣ R a i n y) \end{aligned}$

That is extremely difficult to do when we the number of our states ( $N$ ) are numerous or we have some arbitrarily large number observations, $T$ , since we’d be analyzing $N^{T}$ different possible sequences. However, we can predict the hidden states of HMMs like this one, using the Forward-Backward Algorithm.

The Forward-Backward Algorithm

The Forward-Backward Algorithm is a dynamic programming algorithm used to infer the probability of seeing observations in a Hidden Markov Model. It contains two sub-algorithms, the Forward Algorithm & the Backward Algorithm, which respectively compute the probability of the data from the beginning and the end of observations until they collide and result in a total probability of the data⁷. The process may seem extra, but as you’ve seen in the last section’s toy computation, we have to be extra out of necessity.

NOTE: The following sections heavily drawn from the following sources ⁸ ⁹ ¹⁰ ¹¹

Forward

In the first section of the Forward-Backward Algorithm, we compute the forward probabilities. Forward probabilities are informally the joint probability of all observations so far and the probability of being in some given state space.

Formally, a forward probability is defined as $α_{k} (j) = P (d_{1}, \dots, d_{k}, w_{k} = j ∣ F)$ > NOTE: That is, we compute the probability of the first $k$ observations ( $d_{1}, \dots, d_{k}$ ) leading us to our current observation while also being in state $j$ , $α_{k} (j)$ .

To compute the forward probabilities, we must

Sum over all previous forward probabilities that would lead us to be in $α_{k} (j)$
Multiply by the transition probabilities $a_{i j}$ describing the change from the previous state $i$ to the current state $j$ .
Multiply again by the emission probabilities of our current observation $d_{k}$ , given the state $j$ . That is $b_{j} (d_{k})$ .

Formally, this is written as

$α_{k} (j) = [\sum_{i = 1}^{W} α_{k - 1} (i) a_{i j}] b_{j} (w_{k})$

Note: Remember that $W$ is the state space!

In practice, we typically use recursion to compute the probability of our observations. Like so

1.) Base Case: We use the initial probability of our system’s states.

$\begin{aligned} α_{1} (j) = π_{j} b_{j} (d_{1}) & for 1 \leq j \leq W \end{aligned}$

2.) Recursion: We use the previous case & transition & emission probabilities to compute the probability of being in the newest state $j$ .

$\begin{aligned} α_{k} (j) = [\sum_{i = 1}^{W} α_{k - 1} (i) a_{i j}] b_{j} (d_{k}) & for 1 \leq j \leq W, 1 \leq k \leq n \end{aligned}$

2.) End: We can then define the probability of the observations as the recursive sum of all forward probabilities.

$P (D_{1 : n} = d_{1 : n} ∣ λ) = \sum_{i = 1}^{W} α_{n} (i)$

Backward

In the second section of the Forward-Backward Algorithm, we compute the backward probabilities. Backward probabilities are informally the joint probability of successive observations and the probability of being in some arbitrary state $v$ , given that we start an observation $k + 1$ .

Formally, a backward probability is defined as $β_{k} (v) = P (d_{k + 1}, \dots, d_{n}, ∣ w_{k} = v, F)$

To compute the backward probabilities, we must - Sum the successive values after $β_{k + 1}$ until the end of the k-observations at $d_{n}$ . - Multiply by the transition probabilities $a_{v j}$ describing the change from the given current state $v$ to the next state $j$ . - Multiply again by the emission probabilities of the success observation $d_{k + 1}$ . That is $b_{j} (d_{k + 1})$ .

Formally, this is written as

$β_{k} (v) = \sum_{j = 1}^{W} a_{v j} b_{j} (d_{k + 1}) β_{k + 1} (j)$

Note: Remember that $W$ is the state space!

In practice, we typically use recursion to compute the probability of our observations. Like so

1.) Base Case: We use the initial probability of our system’s states. The value is 1 since we’re working backward!

$\begin{aligned} β_{n} (i) = 1 & for 1 \leq v \leq W \end{aligned}$

2.) Recursion: We use the previous transition & emission probabilities of being in the previous state $i$ from the state $j$ to weigh the sum of all success values.

$\begin{aligned} β_{k} (v) = \sum_{j = 1}^{W} a_{v j} b_{j} (d_{k + 1}) β_{k + 1} (j) & for 1 \leq v \leq W, 1 \leq k \leq n \end{aligned}$

2.) End: We can then define the backward probability of the observations as the recursive sum of values after $k$ , beginning with the end value.

$P (D_{k + 1 : n} = d_{k + 1 : n} ∣ λ) = \sum_{j = 1}^{W} π_{j} b_{j} (d_{1}) β_{1} (j)$ In the end, we’ve done the same thing! But now, we can use the probabilities of our observations to study the distribution of the hidden states.

Collision

We can use Bayes’ Rule, joint probability, and our previous work to compute the probability of being in a state $v$ at a given time, $t$ . This is commonly called a smoother analysis. Like before, we will use the notation of our frog example to describe the states ( $W_{i}$ ) and observations ( $D_{i}$ ).

By Bayes’ Rule, the probability of being in state, $v$ , given the observations, is

$γ_{t} (v) = P (W_{t} = v ∣ D_{i}) = \frac{P (W_{i} \cap D_{i})}{P (D_{i})}$

However, we can make this simpler by acknowledging that $P (W_{i} ∣ D_{i})$ is proportional (e.g., division by a constant value) to the joint probability of $W_{i}$ and $D_{i}$ . So,

$\begin{aligned} γ_{t} (v) & \propto P (W_{i} \cap D_{i}) \end{aligned}$

We can simplify this even further by splitting our $n$ observations at the $k^{t} h$ observation, like

$D_{i} = {d_{1}, \dots, d_{n}}$
$D_{1 : k} = {d_{1}, \dots, d_{k}}$
$D_{k + 1 : n} = {d_{k + 1}, \dots, d_{n}}$

Then,

$\begin{aligned} γ_{t} (v) & \propto P (W_{i} \cap D_{i}) \\ \propto P (D_{k + 1 : n} ∣ W_{i} \cap D_{1 : k}) \times P (W_{i} \cap D_{1 : k}) & by Bayes Rule \\ \propto P (D_{k + 1 : n} ∣ W_{i}) \times P (W_{i} \cap D_{1 : k}) & by Markov Property of D_{i} \end{aligned}$

We recognize these as the backward probability times the forward. We include our constant value $\frac{1}{\sum^{W} β_{t} (v) \times α_{t} (v)}$ .

$\begin{aligned} γ_{t} (v) = P (W_{t} = v ∣ D_{i}) = backward \times forward \times \frac{1}{\sum^{W} β_{t} (v) \times α_{t} (v)} \end{aligned}$

Note: We’ve multiplied by the constant value of $\frac{1}{\sum^{W} β_{t} (v) \times α_{t} (v)}$ because the marginal distribution of the $D_{i}$ would be the sum of the probabilities of the observations for each possible state, by the Law of Total Probability.

Thus, the probability of being in a given state $v$ , at a particular time, $t$ , given the data is

$γ_{t} (v) = P (W_{t} = v ∣ D_{i}) = \frac{β_{t} (v) \times α_{t} (v)}{\sum^{W} β_{t} (v) \times α_{t} (v)}$

We did it! Intuitively, it might not make sense why we can compute the probability of being at a state by multiplying the backward & forward probabilities, then dividing by the possible states.

The idea is that the recursive functions “scan” through the observations, determining some probability of the next or previous outcome, until they collide and essentially collapse onto a probability of that observation. They do this for each possible observation. Then, by dividing the total probability of the observations, we can get the probability of the $v$ -state.

2.3 Conclusion

We did it! We’ve learned how to compute the probability of the observations & the probability of a particular state, given a time, using the Forward-Backward Algorithm. There are, of course, other problems that Hidden Markov Models can solve, like finding the most probable sequence of all states or how to maximize the functions forward or backward functions.

However, the Forward-Backward Algorithm has so many meaningful applications throughout the natural & social sciences alone. We’ll look at a couple in the next chapter.

If you have any lingering questions, I’ve linked some great YouTube videos that may be helpful below.

Video Resources

Hidden Markov Model Properties

Forward Backward & Likelihood

Hidden Markov Models & their Applications to Statistical Genetics