Chapter Objectives
By the end of this chapter, the reader should expect to accomplish the following:
Formulate hidden Markov models (HMMs) for probabilistic modeling over hidden states;
Gain familiarity with the Baum-Welch algorithmic for fitting HMMs to time series data;
Use the Viterbi algorithm to find the most likely path;
Gain familiarity with state-space models and the application of Kalman filters to fit them; and
Apply particle filters to financial time series.
Hidden Markov Modeling
The Bayesian network representing the conditional dependence relations between the observed and the hidden variables in the HMM. The conditional dependence relationships define the edges of a graph between parent nodes, Yt, and child nodes St.
Example 7.1 Bull or Bear Market?
Suppose that the market is either in a Bear or Bull market regime represented by s=0 or s=1, respectively. Such states or regimes are assumed to be hidden. Over each period, the market is observed to go up or down and represented by y=−1 or y=1. Assume that the emission probability matrix-the conditional dependency matrix between observed and hidden variables is independent of time and given by P(yt=y∣st=s)=⎣⎡y/s−1100.80.210.20.8⎦⎤
and the transition probability density matrix for the Markov process {St} is given by A=[0.90.10.10.9],[A]ij:=P(St=si∣St−1=sj)
Given the observed sequence {−1,1,1} (i.e., T=3 ), we can compute the probability of a realization of the hidden state sequence {1,0,0} using Eq. 7.1. Assuming that P(s1=0)=P(s1=1)=21, the computation is P(s,y)==P(s1=1)P(y1=−1∣s1=1)P(s2=0∣s1=1)P(y2=1∣s2=0)P(s3=0∣s2=0)P(y3=1∣s3=0)0.5⋅0.2⋅0.1⋅0.2⋅0.9⋅0.2=0.00036
the forward and backward quantities Ft(s):=P(st=s,y1:t) and Bt(s):=p(yt+1:T∣st=s)
with the convention that BT(s)=1. For all t∈{1,…,T} and for all r,s∈{1,…,K} we have P(st=s,y)=Ft(s)Bt(s),
and combining the forward and backward quantities gives P(st−1=r,st=s,y)=Ft−1(r)P(st=s∣st−1=r)p(yt∣st=s)Bt(s).
The forward-backward algorithm, also known as the Baum-Welch algorithm, is an unsupervised learning algorithm for fitting HMMs which belongs to the class of EM algorithms.
The Viterbi Algorithm
Estimate the most likely sequence realization using the Viterbi algorithm.
Suppose once again that we observe a sequence of T observations, y={y1,…,yT}
However, for each 1≤t≤T,yt∈O, where O={o1,o2,…,oN},N∈N, is now in some observation space.
We suppose that, for each 1≤t≤T, the observation yt is driven by a (hidden) state st∈S, where S:={∫1,…,∫K},K∈N, is some state space.
For example, yt might be the credit rating of a corporate bond and st might indicate some latent variable such as the overall health of the relevant industry sector.
Given y, what is the most likely sequence of hidden states, x={x1,x2,…,xT}?
a few more constructs needed to answer this question
the set of initial probabilities: π={π1,…,πK},
so that πi is the probability that s1=∫i,1≤i≤K.
the transition matrixA∈RK×K, such that the element Aij,1≤i,j≤K is the transition probability of transitioning from state ∫i to state ∫j.
the emission matrixB∈RK×N, such that the element Bij, 1≤i≤K,1≤j≤N is the probability of observing oj from state ∫i.
Example 7.2 The Crooked Dealer
A dealer has two coins, a fair coin, with P (Heads) =21, and a loaded coin, with P (Heads) =54. The dealer starts with the fair coin with probability 53. The dealer then tosses the coin several times. After each toss, there is a 52 probability of a switch to the other coin. The observed sequence is Heads, Tails, Heads, Tails, Heads, Heads, Heads, Tails, Heads.
In this case, the state space and observation space are, respectively, S={∫1= Fair, ∫2= Loaded },O={o1= Heads, o2= Tails }
with initial probabilities π={π1=0.6,π2=0.4}
transition probabilities A=(0.60.40.40.6)
and the emission matrix is B=(0.50.80.50.2)
Given the sequence of observations y=(Heads, Tails, Heads, Tails, Heads, Heads, Heads, Tails, Heads) ,
we would like to find the most likely sequence of hidden states, s={s1,…,sT}, i.e., determine which of the two coins generated which of the coin tosses.
the Viterbi algorithm
the most likely state sequences, which produces the observation sequence y={y1,…,yT}, satisfies the recurrence relations V1,k=P(y1∣s1=∫k)⋅πk,V2,k=1≤i≤Kmax(P(y2∣s2=∫k)⋅Aik⋅V1,i),Vt,k=1≤i≤Kmax(P(yt∣st=∫k)⋅Aik⋅Vt−1,i),
where Vt,k is the probability of the most probable state sequence {s1,…,st} such that st=∫k, Vt,k=P(s1,…,st,y1,…,yt∣st=∫k).
The actual Viterbi path can be obtained by, at each step, keeping track of which state index i was used in the second equation. Let ξ(k,t) be the function that returns the value of i that was used to compute Vt,k if t>1, or k if t=1. Then sTst−1=∫arg 1≤i≤Kmax(VT,k)=∫ξ(st,t).
State-Space Models
The state transition probability p(st∣st−1) can be decomposed into deterministic and noise: st=Ft(st−1)+ϵt
ϵt is zero-mean i.i.d. noise.
the emission probability p(yt∣st) can be decomposed as: yt=Gt(st)+ξt,
with zero-mean i.i.d. observation noise.
If the functions Ft and Gt are linear and time independent, then we have st=Ast−1+ϵt,yt=Cst+ξt
where A is the state transition matrix and C is the observation matrix.
Particle Filtering
A Kalman filter maintains its state as moments of the multivariate Gaussian distribution, N(m,P).
This approach is appropriate when the state is Gaussian, or when the true distribution can be closely approximated by the Gaussian.
What if the distribution of states is, for example, bimodal?
the simplest way to approximate any distribution is by data points sampled from that distribution We refer to those data points as "particles."
The more particles we have, the more closely we can approximate the target distribution.
The empirical distribution is then given by the histogram.
Note that the particles need not be univariate, as in our example. They may be multivariate if we are approximating a multivariate distribution.
This setup gives rise to the family of algorithms known as particle filtering algorithms (Gordon et al. 1993; Kitagawa 1993).
One of the most common of them is the Sequential Importance Resampling (SIR) algorithm:
Sequential Importance Resampling (SIR)
a. Initialization step: At time t=0, draw M i.i.d. samples from the initial distribution τ0. Also, initialize M normalized (to 1) weights to an identical value M1. For i=1,…,M, the samples will be denoted x^0∣0(i) and the normalized weights λ0(i).
b. Recursive step: At time t=1,…,T, let (x^t−1∣t−1(i))i=1,…,M be the particles generated at time t−1.
Importance sampling: For i=1,…M, sample t^t∣t−1(i) from the Markov transition kernel τt(⋅∣x^t−1∣t−1(i)). For i=1,…,M, use the observation density to compute the non-normalized weights ωt(i):=λt−1(i)⋅p(yt∣x^i∣t−1(i))
and the values of the normalized weights before resampling ("br") brλt(i):=∑k=1Mωt(k)ωt(i).
Resampling (or selection): For i=1,…,M, use an appropriate resampling algorithm (such as multinomial resampling-see below) to sample xt∣t(i) from the mixture k=1∑Mbrλt(k)δ(xt−xt∣t−1(k)),
where δ(⋅) denotes the Dirac delta generalized function, and set the normalized weights after resampling, λt(i), appropriately (for most common resampling algorithms this means λt(i):=M1 ).
Multinomial Resampling
Notice, fromabove, that we are using with the normalized weights computed before resampling, brλt(1),…,brλt(M) :
a. For i=1,…,M, compute the cumulative sums bΛt(i)=k=1∑ibrλt(k),
so that, by construction, brΛt(M)=1.
b. Generate M random samples from U(0,1),u1,u2,…,uM.
c. For each i=1,…,M, choose the particle x^t∣t(i)=x^t∣t−1(j) with j∈{1,2,…,M−1} such that ui∈[brΛt(j),brΛt(j+1)]
Thus we end up with M new particles (children), xt∣t(1),…,xt∣t(M) sampled from the existing set xt∣t−1(1),…,xt∣t−1(M), so that some of the existing particles may disappear, while others may appear multiple times. For each i=1,…,M the number of times xt∣t−1(i) appears in the resampled set of particles is known as the particle's replication factor, Nt(i).
We set the normalized weights after resampling: λt(i):=M1. We could view this algorithm as the sampling of the replication factors Nt(1),…Nt(M) from the multinomial distribution with probabilities bλt(1),…,hrλt(M), respectively. Hence the name of the method.
Summary
Formulate hidden Markov models (HMMs) for probabilistic modeling over hidden states;
Gain familiarity with the Baum–Welch algorithm for fitting HMMs to time series data;
Use the Viterbi algorithm to find the most likely path;
Gain familiarity with state-space models and the application of Kalman filters to fit them; and