Hidden Markov Model
Emission probability
The emission probability is what gives way to the observed state. In other words, given the state
Still using the example from before, the difference now is that you don’t know if the student is sleeping, watching netflix, or doing homework. Instead, the only information you have is whether or not the student is stressed (observed state). Depending on your actual state (sleep, homework, netflix), the probability of being stressed changes.
Relaxed | Stressed | |
---|---|---|
Sleep | 0.9 | 0.1 |
Homework | 0.1 | 0.9 |
Netflix | 0.7 | 0.3 |
For reference, here is the initial probabilities from before:
Outcome | Likelihood |
---|---|
Sleep | 0.6 |
Homework | 0.3 |
Netflix | 0.1 |
As well as the transition probabilities:
Keep in mind:
Sleep | Homework | Netflix | |
---|---|---|---|
Sleep | 0.5 | 0.4 | 0.1 |
Homework | 0.4 | 0.2 | 0.4 |
Netflix | 0.3 | 0.1 | 0.6 |
Example
Observations |
1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
---|---|---|---|---|---|---|---|---|---|
State Path |
H | H | N | N | N | H | S | S | S |
Emission Probability |
0.9 | 0.9 | 0.7 | 0.7 | 0.3 | 0.9 | 0.1 | 0.9 | 0.9 |
Transition Probability |
0.3 | 0.2 | 0.4 | 0.6 | 0.6 | 0.1 | 0.4 | 0.5 | 0.5 |
What is the likelihood of a specific observation sequence?
What the above equation describes is that to determine the likelihood of an observation sequence, you must add the probabilities of the observed sequence and the potential (hidden) state paths occurring. So how do we do this???
Based on conditional probability:
Multiply the emission probabilities together:
Given a state sequence, what is the probability of emitting the observed sequence.
Putting it all together: The probability of an observed sequence and a corresponding hidden state sequence is equal to the product of the emission probabilities multiplied by the product of the transition probabilities.
We could potentially state all possible state paths given a particular observations. If we observed: {1, 0, 0}, the possible state paths include SSS, SHH, SHN, SNN, SNH, HHH, HNS, HSH, HSS, etc. We could calculate the possibility of each, for example: SNS. Initial probability of sleep is 0.6. The probability of being stressed given that they are sleeping is 0.1. Sleep transitioning to Netflix is 0.1.Probability of being relaxed watching Netflix is 0.3. And so on…
To find the total probability of an observed sequence (e.g. 1,0,0), we need to add up all of the possible state path probabilities (SNS, SSS, HSN, etc.).
Forward algorithm
The forward algorithm uses dynamic programming. What this means is that the problem is broken down into sub-problems which you find the optimal solution to in order to solve the larger problem. The solutions from the sub-problem are stored (memoization) so that you don’t waste time computing it over and over again.
To implement this, you need to create an empty matrix to store intermediate values. The rows equal the number of states (in this case, 3), while the columns should be equal to the number of values in the observation sequence.
First column: The first column states the probability of each state being the initial state and emitting the first observation. All future calculations will be based on these values.
Subsequent columns: Each value in the matrix represents the likelihood of being in a certain state k having observed the first i values in the observation sequence. We will represent this by:
def forward():
# if Stressed = 0, Relaxed = 1
observations = [1, 0, 0, 1, 1, 0, 1, 1]
#build empty dynamic programming matrix
num_states= 3
obs_seq_num = len (observations)
grid = []
for j in range(num_states):
grid.append(obs_seq_num * [0])
# initial probability distribution for S, H, N
initial = [0.6,0.3,0.1]
# emission probability
emission= [[0.9,0.1],[0.1,0.9],[0.7,0.3]]
#transition probabilities
transition = [[0.5,0.4,0.1],[0.4,0.2,0.4],[0.3,0.1,0.6]]
#Start filling out the grid
for t in range(obs_seq_num):
for i in range(num_states):
#first column
if t == 0:
# use the initial probability distribution
grid[i][t] = initial[i] * emission [i][observations[t]]
else:
grid[i][t] = emission [i][observations[t]] * ((grid[0][t-1]*transition[0][i]) + (grid[1][t-1]*transition[1][i]) + (grid[2][t-1]*transition[2][i]))
prettyPrint (grid)
print ("Probability of observing: ")
for k in observations:
print (observations[k], end=" ")
print ()
for s in range(obs_seq_num):
sum=0
for q in range (num_states):
sum+=grid[q][s]
print (format(sum, "5.4f"), end = " ")
forward()
## 0 1 2 3 4 5 6 7
## S 0.0600000 0.1323000 0.0873990 0.0061352 0.0021659 0.0064380 0.0005115 0.0001805
## H 0.2700000 0.0081000 0.0063780 0.0371420 0.0100145 0.0003557 0.0029078 0.0008235
## N 0.0300000 0.0924000 0.0503370 0.0124480 0.0068817 0.0058460 0.0012881 0.0005961
##
## Probability of observing:
## 0 1 1 0 0 1 0 0
## 0.3600 0.2328 0.1441 0.0557 0.0191 0.0126 0.0047 0.0016
For example, how did we calculate the probability of being at state S at time 2 with the observations thus far being: 1 0.
Decoding: Viterbi Algorithm
The viterbi algorithm attempts to answer: Given a sequence of observations, what is the most likely corresponding state path? It also uses dynamic programming by going from the first observation to the current observation and finding the most likely path, a.k.a the max. To find the path, we must also use backtracking.
Again, similar to the forward algorithm, you will need a matrix to store intermediate maximum values. Each value represents the probability of being at state
But what about the actual path? This is when we use backtracking and argmax instead of max. You need a separate matrix to keep track of these values. If you have a typical hyperbola, at the highest point, the y-value you obtain is the max, while the x-value in which you used as input into the function to obtain the highest point is the argmax. For example, if the max from before was from a hidden state of “S”, this value would be recorded in the backtracking matrix.
def viterbi():
# if Stressed = 0, Relaxed = 1
observations = [1, 0, 0, 1, 1, 0, 1, 1]
#build empty dynamic programming matrix
num_states= 3
obs_seq_num = len (observations)
grid = []
backtracking=[]
for j in range(num_states):
grid.append(obs_seq_num * [0])
backtracking.append(obs_seq_num * [0])
# initial probability distribution for S, H, N
initial = [0.6,0.3,0.1]
# emission probability
emission= [[0.9,0.1],[0.1,0.9],[0.7,0.3]]
#transition probabilities
transition = [[0.5,0.4,0.1],[0.4,0.2,0.4],[0.3,0.1,0.6]]
for t in range(obs_seq_num):
for i in range(num_states):
#first column
if t == 0:
# use the initial probability distribution
grid[i][t] = initial[i] * emission [i][observations[t]]
else:
Option_S=grid[0][t-1]*transition[0][i]
Option_H=grid[1][t-1]*transition[1][i]
Option_N=grid[2][t-1]*transition[2][i]
grid[i][t] = emission [i][observations[t]] * max (Option_S, Option_H, Option_N)
if Option_S>Option_H and Option_S>Option_N:
backtracking[i][t]="S"
elif Option_H> Option_N and Option_H>Option_S:
backtracking[i][t]="H"
else:
backtracking[i][t]="N"
prettyPrint (grid)
print("Backtracking matrix")
for w in backtracking:
print(w)
#termination
if grid[0][-1]> grid[1][-1] and grid[0][-1]>grid[2][-1]:
path="S"
pointer=0
elif grid[1][-1]>grid[0][-1] and grid[1][-1]>grid[2][-1]:
path="H"
pointer=1
else:
path="N"
pointer=2
for s in range((obs_seq_num-1),0, -1):
hidden_state=backtracking[pointer][s]
path=hidden_state+" "+path
if hidden_state=="S":
pointer=0
elif hidden_state=="H":
pointer=1
else:
pointer=2
print ("Most likely state path: " + path)
viterbi()
## 0 1 2 3 4 5 6 7
## S 0.0600000 0.0972000 0.0437400 0.0021870 0.0006299 0.0010204 0.0000510 0.0000147
## H 0.2700000 0.0054000 0.0038880 0.0157464 0.0028344 0.0000567 0.0003673 0.0000661
## N 0.0300000 0.0756000 0.0317520 0.0057154 0.0018896 0.0007936 0.0001429 0.0000441
##
## Backtracking matrix
## [0, 'H', 'S', 'S', 'H', 'H', 'S', 'H']
## [0, 'H', 'S', 'S', 'H', 'H', 'S', 'H']
## [0, 'H', 'N', 'N', 'H', 'H', 'N', 'H']
## Most likely state path: H S S H H S H H
For example, how did they calculate the probability of the most probable path at time 2 if the last state is S? The maximum of the time
To find the path, go to last column and find the state with the highest probability, that is the last hidden state: ? ? ? ? ? ? ? H. Then using the backtracking matrix constructed from the argmax, work backwards. In this case, since the last hidden state was H, go to the last column in the backtracking matrix and the row for H (the middle), which also has an H: ? ? ? ? ? ? H H Therefore, go to the second to last column and again the row for H, this has the value S: ? ? ? ? ? S H H. Then go to the third to last column, row for S …. And so on… Finally: H S S H H S H H.
Packages
This is too much math/I don’t care. There are plenty of packages available on the internet, for example, these methods are implemented in the R package, HMM, which I think is super helpful. Just understanding what the different terms mean such as transition matrix, initial probability distribution, emission probability, etc. should be enough to use these tools.
Application
This is a good exercise to test your understanding. It looks at predicting the secondary structure of proteins from the primary sequence. Another common application in bioinformatics is detecting CpG islands.