Hidden Markov Model

Apr 20, 2020 11 min read Statistics, Probability, Bioinformatics

Hidden Markov Model

I go over the basics of a Markov Chain here, however, unlike chutes and ladders, for a hidden markov model, there is hidden information. In this case the markov chain is not visible to us, we can only observe outcome values.

Like a markov chain, there are still states (now hidden), transition probabilities, and initial probabilities. However, in addition to this, we also have an observed state and an emission probability.

Emission probability

The emission probability is what gives way to the observed state. In other words, given the state $π$ , what is the probability of the observed state $x$ happening at time $i$ .

$P (x_{i} | π_{i}) = emission probability$

Still using the example from before, the difference now is that you don’t know if the student is sleeping, watching netflix, or doing homework. Instead, the only information you have is whether or not the student is stressed (observed state). Depending on your actual state (sleep, homework, netflix), the probability of being stressed changes.

	Relaxed	Stressed
Sleep	0.9	0.1
Homework	0.1	0.9
Netflix	0.7	0.3

For reference, here is the initial probabilities from before:

Outcome	Likelihood
Sleep	0.6
Homework	0.3
Netflix	0.1

As well as the transition probabilities: Keep in mind: $P (π_{i - 1} \to π_{i})$ and $P (π_{i} | π_{i - 1})$

	Sleep	Homework	Netflix
Sleep	0.5	0.4	0.1
Homework	0.4	0.2	0.4
Netflix	0.3	0.1	0.6

Example

Observations $x$ (1= stressed, 0=relaxed)	1	1	0	0	1	1	1	0	0
State Path $π$ (S= sleep, H=homework, N=netflix)	H	H	N	N	N	H	S	S	S
Emission Probability $P (x_{i} \| π_{i})$	0.9	0.9	0.7	0.7	0.3	0.9	0.1	0.9	0.9
Transition Probability $P (π_{i - 1} \to π_{i})$	0.3	0.2	0.4	0.6	0.6	0.1	0.4	0.5	0.5

What is the likelihood of a specific observation sequence?

$P (x) = \sum_{π} P (x and π)$

What the above equation describes is that to determine the likelihood of an observation sequence, you must add the probabilities of the observed sequence and the potential (hidden) state paths occurring. So how do we do this???

Based on conditional probability:

$P (x and π) = P (x | π) \cdot P (π)$

Multiply the emission probabilities together: Given a state sequence, what is the probability of emitting the observed sequence. $P (x | π) = \prod_{t = 1} P (x_{t} | π_{t})$ Multiply the transition probabilities together: Probabililty of a state sequence depends on the markov property that the state at time t is only dependent on the state at t-1.

$P (π) = \prod_{t = 1} P (π_{t} | π_{t - 1})$

Putting it all together: The probability of an observed sequence and a corresponding hidden state sequence is equal to the product of the emission probabilities multiplied by the product of the transition probabilities.

$P (π and x) = \prod_{t = 1} P (π_{t} | x_{t}) \cdot \prod_{t = 1} P (π_{t} | π_{t - 1})$ As an example:

We could potentially state all possible state paths given a particular observations. If we observed: {1, 0, 0}, the possible state paths include SSS, SHH, SHN, SNN, SNH, HHH, HNS, HSH, HSS, etc. We could calculate the possibility of each, for example: SNS. Initial probability of sleep is 0.6. The probability of being stressed given that they are sleeping is 0.1. Sleep transitioning to Netflix is 0.1.Probability of being relaxed watching Netflix is 0.3. And so on…

$P (π_{1} = S | start) \cdot P (x_{1} = 1 | π_{1} = S) \cdot P (π_{2} = N | π_{1} = S) \cdot P (x_{1} = 0 | π_{1} = N) \cdot P (π_{3} = S | π_{2} = N) \cdot P (x_{3} = 0 | π_{3} = S)$ $P (π = S N S \cap x = 1 0 0) = 0.6 \cdot 0.1 \cdot 0.1 \cdot 0.3 \cdot 0.3 \cdot 0.9 = 0.000486$

To find the total probability of an observed sequence (e.g. 1,0,0), we need to add up all of the possible state path probabilities (SNS, SSS, HSN, etc.).

$P (1 0 0) = P (1 0 0 | S N S) + P (1 0 0 | N N S) + P (1 0 0 | S N H) + . . .$ However, what if the observation sequence is long and therefore the corresponding hidden state sequence is also long which many possible permutations. Calculating the likelihood of a observation would take forever… which is why we use the forward algorithm.

Forward algorithm

The forward algorithm uses dynamic programming. What this means is that the problem is broken down into sub-problems which you find the optimal solution to in order to solve the larger problem. The solutions from the sub-problem are stored (memoization) so that you don’t waste time computing it over and over again.

To implement this, you need to create an empty matrix to store intermediate values. The rows equal the number of states (in this case, 3), while the columns should be equal to the number of values in the observation sequence.
First column: The first column states the probability of each state being the initial state and emitting the first observation. All future calculations will be based on these values.
Subsequent columns: Each value in the matrix represents the likelihood of being in a certain state k having observed the first i values in the observation sequence. We will represent this by:

$P (π_{t} = k and x = x_{1}, x_{2}, x_{3}, . . . x_{t}) = α_{t} (k)$ This is calculated by taking the emission probability of emitting $x_{t}$ given that we are at state k and multiplying this by the summation of the probabilities of being at the previous state $l$ at t-1 and transitioning to state k.

$α_{t} (k) = (emission probability of x_{t} given current state k) \sum α_{t - 1} (l) \cdot (transition probability from l to k)$ $α_{t} (k) = P (x_{t} | π_{t}) \sum α_{t - 1} (l) \cdot P (π_{t} | π_{t - 1})$ OK, that’s just a ton of math, what does it actually mean? Still using the example from before with the observed sequence 1 0 0, if we have already calculated all the paths from 1 0, we then need to calculate the path to the next 0 based on these paths formed from 1 0. Therefore for each value, you must look at the values in the previous columns.

def forward():
  # if Stressed = 0, Relaxed = 1
  observations = [1, 0, 0, 1, 1, 0, 1, 1]
  #build empty dynamic programming matrix 
  num_states= 3
  obs_seq_num = len (observations)
  grid = []
  for j in range(num_states):  
    grid.append(obs_seq_num * [0])
        
    
  # initial probability distribution for S, H, N
  initial = [0.6,0.3,0.1]
  
  # emission probability 
  emission= [[0.9,0.1],[0.1,0.9],[0.7,0.3]]
        
  #transition probabilities 
  transition = [[0.5,0.4,0.1],[0.4,0.2,0.4],[0.3,0.1,0.6]]

  #Start filling out the grid
  for t in range(obs_seq_num):
    for i in range(num_states):
      #first column 
      if t == 0: 
        # use the initial probability distribution 
        grid[i][t] = initial[i] * emission [i][observations[t]]
      else: 
        grid[i][t] = emission [i][observations[t]] * ((grid[0][t-1]*transition[0][i]) + (grid[1][t-1]*transition[1][i]) + (grid[2][t-1]*transition[2][i]))
  
  prettyPrint (grid)
  
  print ("Probability of observing: ")
  for k in observations:
    print (observations[k], end="        ")
  print ()
  
  for s in range(obs_seq_num): 
    sum=0
    for q in range (num_states):
      sum+=grid[q][s]
    print (format(sum, "5.4f"), end = "   ")
forward()

##          0          1          2          3          4          5          6          7 
## S  0.0600000  0.1323000  0.0873990  0.0061352  0.0021659  0.0064380  0.0005115  0.0001805  
## H  0.2700000  0.0081000  0.0063780  0.0371420  0.0100145  0.0003557  0.0029078  0.0008235  
## N  0.0300000  0.0924000  0.0503370  0.0124480  0.0068817  0.0058460  0.0012881  0.0005961  
## 
## Probability of observing: 
## 0        1        1        0        0        1        0        0        
## 0.3600   0.2328   0.1441   0.0557   0.0191   0.0126   0.0047   0.0016

For example, how did we calculate the probability of being at state S at time 2 with the observations thus far being: 1 0.

$P (π_{2} = S and x_{2} = 0) = 0.9 \cdot (0.06 \cdot 0.5 + 0.27 \cdot 0.4 + 0.03 \cdot 0.3) = 0.1323$ Finally, what is the probability of the following sequence: 1, 0, 0? This is equal to the sum of all the probabilities at the end time: $P (x = 1 0 0) = 0.0874 + 0.0064 + 0.0503 = 0.1441$ Additional note: How do you find the probability of, for example, being at state S at time 3? This is just the likelihood probability from the matrix divided by the sum of the entire column (what we calculated above): $\frac{0.0874}{0.0874 + 0.0064 + 0.0503} = 0.6065$

Decoding: Viterbi Algorithm

The viterbi algorithm attempts to answer: Given a sequence of observations, what is the most likely corresponding state path? It also uses dynamic programming by going from the first observation to the current observation and finding the most likely path, a.k.a the max. To find the path, we must also use backtracking.

$P (π = π_{1}, π_{2}, . . . k and x = x_{1}, x_{2}, x_{3}, . . . x_{t}) = v_{t} (k)$ As you can see from the formula, it is very similar to forward except we use the max instead of the sum of previous probabilities.

$v_{t} (k) = max (v_{t - 1} (l) \cdot emission probability of x_{t} given state k \cdot transition probability from l to k)$ $v_{t} (k) = max (v_{t - 1} (l) \cdot P (x_{t} | π_{t}) \cdot P (π_{t} | π_{t - 1}))$

Again, similar to the forward algorithm, you will need a matrix to store intermediate maximum values. Each value represents the probability of being at state $k$ at time $t$ given the observations until time $t$ and having gone through the most probable path at time $t - 1$ . To compute this, you recursively find the most probable paths to each cell.

But what about the actual path? This is when we use backtracking and argmax instead of max. You need a separate matrix to keep track of these values. If you have a typical hyperbola, at the highest point, the y-value you obtain is the max, while the x-value in which you used as input into the function to obtain the highest point is the argmax. For example, if the max from before was from a hidden state of “S”, this value would be recorded in the backtracking matrix.

$π_{t} (k) = argmax (v_{t - 1} (l) \cdot P (x_{t} | π_{t}) \cdot P (π_{t} | π_{t - 1}))$

def viterbi(): 
  # if Stressed = 0, Relaxed = 1
  observations = [1, 0, 0, 1, 1, 0, 1, 1]
  #build empty dynamic programming matrix 
  num_states= 3
  obs_seq_num = len (observations)
  grid = []
  backtracking=[]
  for j in range(num_states):  
    grid.append(obs_seq_num * [0])
    backtracking.append(obs_seq_num * [0])
        
  # initial probability distribution for S, H, N
  initial = [0.6,0.3,0.1]
  
  # emission probability 
  emission= [[0.9,0.1],[0.1,0.9],[0.7,0.3]]
        
  #transition probabilities 
  transition = [[0.5,0.4,0.1],[0.4,0.2,0.4],[0.3,0.1,0.6]]



  for t in range(obs_seq_num):
    for i in range(num_states):
      #first column 
      if t == 0: 
        # use the initial probability distribution 
        grid[i][t] = initial[i] * emission [i][observations[t]]
      else: 
        Option_S=grid[0][t-1]*transition[0][i]
        Option_H=grid[1][t-1]*transition[1][i]
        Option_N=grid[2][t-1]*transition[2][i]
        grid[i][t] = emission [i][observations[t]] * max (Option_S, Option_H, Option_N)
        if Option_S>Option_H and Option_S>Option_N: 
          backtracking[i][t]="S"
        elif Option_H> Option_N and Option_H>Option_S: 
          backtracking[i][t]="H"
        else: 
          backtracking[i][t]="N"
        
  prettyPrint (grid)
  
  print("Backtracking matrix")
  for w in backtracking: 
    print(w)
  
  #termination 
  if grid[0][-1]> grid[1][-1] and grid[0][-1]>grid[2][-1]:
    path="S"
    pointer=0
  elif grid[1][-1]>grid[0][-1] and grid[1][-1]>grid[2][-1]: 
    path="H"
    pointer=1
  else: 
    path="N"
    pointer=2
  for s in range((obs_seq_num-1),0, -1):
    hidden_state=backtracking[pointer][s]
    path=hidden_state+" "+path
    if hidden_state=="S":
      pointer=0
    elif hidden_state=="H": 
      pointer=1
    else: 
      pointer=2

    
  print ("Most likely state path: " + path)
  
viterbi()

##          0          1          2          3          4          5          6          7 
## S  0.0600000  0.0972000  0.0437400  0.0021870  0.0006299  0.0010204  0.0000510  0.0000147  
## H  0.2700000  0.0054000  0.0038880  0.0157464  0.0028344  0.0000567  0.0003673  0.0000661  
## N  0.0300000  0.0756000  0.0317520  0.0057154  0.0018896  0.0007936  0.0001429  0.0000441  
## 
## Backtracking matrix
## [0, 'H', 'S', 'S', 'H', 'H', 'S', 'H']
## [0, 'H', 'S', 'S', 'H', 'H', 'S', 'H']
## [0, 'H', 'N', 'N', 'H', 'H', 'N', 'H']
## Most likely state path: H S S H H S H H

For example, how did they calculate the probability of the most probable path at time 2 if the last state is S? The maximum of the time $t - 1$ is if the sequence started with H.

$P (π = π_{1}, . . . S and x = x_{1}, . . .0) = 0.27 \cdot 0.4 \cdot 0.9 = 0.0972$ The highest probability in the very last column of the 1st matrix gives us the probability of the most probable path. Therefore, the probability of the most probable path of 1, 0, 0, 1, 1, 0, 1, 1: 6.611976345600003e-05.

To find the path, go to last column and find the state with the highest probability, that is the last hidden state: ? ? ? ? ? ? ? H. Then using the backtracking matrix constructed from the argmax, work backwards. In this case, since the last hidden state was H, go to the last column in the backtracking matrix and the row for H (the middle), which also has an H: ? ? ? ? ? ? H H Therefore, go to the second to last column and again the row for H, this has the value S: ? ? ? ? ? S H H. Then go to the third to last column, row for S …. And so on… Finally: H S S H H S H H.

Packages

This is too much math/I don’t care. There are plenty of packages available on the internet, for example, these methods are implemented in the R package, HMM, which I think is super helpful. Just understanding what the different terms mean such as transition matrix, initial probability distribution, emission probability, etc. should be enough to use these tools.

Application

This is a good exercise to test your understanding. It looks at predicting the secondary structure of proteins from the primary sequence. Another common application in bioinformatics is detecting CpG islands.

References

Statistics Modelling Markov Chain Probability HMM Forward Algorithm Viterbi Algorithm Emission probability Observed State R

Hidden Markov Model

Hidden Markov Model

Emission probability

Example

What is the likelihood of a specific observation sequence?

Forward algorithm

Decoding: Viterbi Algorithm

Packages

Application

References

Cara Zou

D4

Related