L05d Sequence Modeling

Materials are adopted from "Dixon, Matthew F., Igor Halperin, and Paul Bilokon. Machine learning in Finance. Springer International Publishing, 2020.". This handout is only for teaching. DO NOT DISTRIBUTE.

L05d Sequence Modeling

Chapter Objectives
By the end of this chapter, the reader should expect to accomplish the following:

Formulate hidden Markov models (HMMs) for probabilistic modeling over hidden states;

Gain familiarity with the Baum-Welch algorithmic for fitting HMMs to time series data;

Use the Viterbi algorithm to find the most likely path;

Gain familiarity with state-space models and the application of Kalman filters to fit them; and

Apply particle filters to financial time series.

Hidden Markov Modeling

The Bayesian network representing the conditional dependence relations between the observed and the hidden variables in the HMM. The conditional dependence relationships define the edges of a graph between parent nodes, $Y_{t}$ , and child nodes $S_{t}$ .

$p(\mathbf{s}, \mathbf{y})=p\left(s_{1}\right) p\left(y_{1} \mid s_{1}\right) \prod_{t=2}^{T} p\left(s_{t} \mid s_{t-1}\right) p\left(y_{t} \mid s_{t}\right) .$

Example 7.1 Bull or Bear Market?
Suppose that the market is either in a Bear or Bull market regime represented by $s=0$ or $s=1$ , respectively. Such states or regimes are assumed to be hidden. Over each period, the market is observed to go up or down and represented by $y=-1$ or $y=1$ . Assume that the emission probability matrix-the conditional dependency matrix between observed and hidden variables is independent of time and given by
$\mathbb{P}\left(y_{t}=y \mid s_{t}=s\right)=\left[\begin{array}{ccc} y / s & 0 & 1 \\ -1 & 0.8 & 0.2 \\ 1 & 0.2 & 0.8 \end{array}\right]$
and the transition probability density matrix for the Markov process $\left\{S_{t}\right\}$ is given by
$A=\left[\begin{array}{lll} 0.9 & 0.1 \\ 0.1 & 0.9 \end{array}\right],[A]_{i j}:=\mathbb{P}\left(S_{t}=s_{i} \mid S_{t-1}=s_{j}\right)$
Given the observed sequence $\{-1,1,1\}$ (i.e., $T=3$ ), we can compute the probability of a realization of the hidden state sequence $\{1,0,0\}$ using Eq. 7.1. Assuming that $\mathbb{P}\left(s_{1}=0\right)=\mathbb{P}\left(s_{1}=1\right)=\frac{1}{2}$ , the computation is
$\begin{aligned} \mathbb{P}(\mathbf{s}, \mathbf{y})=& \mathbb{P}\left(s_{1}=1\right) \mathbb{P}\left(y_{1}=-1 \mid s_{1}=1\right) \mathbb{P}\left(s_{2}=0 \mid s_{1}=1\right) \mathbb{P}\left(y_{2}=1 \mid s_{2}=0\right) \\ & \mathbb{P}\left(s_{3}=0 \mid s_{2}=0\right) \mathbb{P}\left(y_{3}=1 \mid s_{3}=0\right) \\ =& 0.5 \cdot 0.2 \cdot 0.1 \cdot 0.2 \cdot 0.9 \cdot 0.2=0.00036 \end{aligned}$

the forward and backward quantities
$F_{t}(s):=\mathbb{P}\left(s_{t}=s, \mathbf{y}_{1:t}\right) \text { and } B_{t}(s):=p\left(\mathbf{y}_{t+1: T} \mid s_{t}=s\right)$
with the convention that $B_{T}(s)=1$ . For all $t \in\{1, \ldots, T\}$ and for all $r, s \in$ $\{1, \ldots, K\}$ we have
$\mathbb{P}\left(s_{t}=s, \mathbf{y}\right)=F_{t}(s) B_{t}(s),$
and combining the forward and backward quantities gives
$\mathbb{P}\left(s_{t-1}=r, s_{t}=s, \mathbf{y}\right)=F_{t-1}(r) \mathbb{P}\left(s_{t}=s \mid s_{t-1}=r\right) p\left(y_{t} \mid s_{t}=s\right) B_{t}(s) .$
The forward-backward algorithm, also known as the Baum-Welch algorithm, is an unsupervised learning algorithm for fitting HMMs which belongs to the class of EM algorithms.

The Viterbi Algorithm

Estimate the most likely sequence realization using the Viterbi algorithm.

Suppose once again that we observe a sequence of $T$ observations,
$\mathbf{y}=\left\{y_{1}, \ldots, y_{T}\right\}$
However, for each $1 \leq t \leq T, y_{t} \in O$ , where $O=\left\{o_{1}, o_{2}, \ldots, o_{N}\right\}, N \in \mathbb{N}$ , is now in some observation space.
We suppose that, for each $1 \leq t \leq T$ , the observation $y_{t}$ is driven by a (hidden) state $s_{t} \in \mathcal{S}$ , where $\mathcal{S}:=\left\{\int_{1}, \ldots, \int_{K}\right\}, K \in \mathbb{N}$ , is some state space.

For example, $y_{t}$ might be the credit rating of a corporate bond and $s_{t}$ might indicate some latent variable such as the overall health of the relevant industry sector.
Given $\mathbf{y}$ , what is the most likely sequence of hidden states,
$\mathbf{x}=\left\{x_{1}, x_{2}, \ldots, x_{T}\right\} ?$
a few more constructs needed to answer this question
- the set of initial probabilities:
  $\pi=\left\{\pi_{1}, \ldots, \pi_{K}\right\},$
  so that $\pi_{i}$ is the probability that $s_{1}=\int_{i}, 1 \leq i \leq K$ .
- the transition matrix $A \in \mathbb{R}^{K \times K}$ , such that the element $A_{i j}, 1 \leq i, j \leq K$ is the transition probability of transitioning from state $\int_{i}$ to state $\int_{j}$ .
- the emission matrix $B \in \mathbb{R}^{K \times N}$ , such that the element $B_{i j}$ , $1 \leq i \leq K, 1 \leq j \leq N$ is the probability of observing $o_{j}$ from state $\int_{i}$ .

Example 7.2 The Crooked Dealer
A dealer has two coins, a fair coin, with $\mathbb{P}$ (Heads) $=\frac{1}{2}$ , and a loaded coin, with $\mathbb{P}$ (Heads) $=\frac{4}{5}$ . The dealer starts with the fair coin with probability $\frac{3}{5}$ . The dealer then tosses the coin several times. After each toss, there is a $\frac{2}{5}$ probability of a switch to the other coin. The observed sequence is Heads, Tails, Heads, Tails, Heads, Heads, Heads, Tails, Heads.

In this case, the state space and observation space are, respectively,
$\mathcal{S}=\left\{\int_{1}=\text { Fair, } \int_{2}=\text { Loaded }\right\}, \quad O=\left\{o_{1}=\text { Heads, } o_{2}=\text { Tails }\right\}$
with initial probabilities
$\pi=\left\{\pi_{1}=0.6, \pi_{2}=0.4\right\}$
transition probabilities
$A=\left(\begin{array}{ll} 0.6 & 0.4 \\ 0.4 & 0.6 \end{array}\right)$
and the emission matrix is
$B=\left(\begin{array}{ll} 0.5 & 0.5 \\ 0.8 & 0.2 \end{array}\right)$
Given the sequence of observations
$\mathbf{y}=(\text{Heads, Tails, Heads, Tails, Heads, Heads, Heads, Tails, Heads})$ ,

we would like to find the most likely sequence of hidden states, $s=$ $\left\{s_{1}, \ldots, s_{T}\right\}$ , i.e., determine which of the two coins generated which of the coin tosses.

the Viterbi algorithm

the most likely state sequences, which produces the observation sequence $y=\left\{y_{1}, \ldots, y_{T}\right\}$ , satisfies the recurrence relations
$\begin{aligned} &V_{1, k}=\mathbb{P}\left(y_{1} \mid s_{1}=\int_{k}\right) \cdot \pi_{k}, \\ &V_{2, k}=\max _{1 \leq i \leq K}\left(\mathbb{P}\left(y_{2} \mid s_{2}=\int_{k}\right) \cdot A_{i k} \cdot V_{1, i}\right), \\ &V_{t, k}=\max _{1 \leq i \leq K}\left(\mathbb{P}\left(y_{t} \mid s_{t}=\int_{k}\right) \cdot A_{i k} \cdot V_{t-1, i}\right), \end{aligned}$
where $V_{t, k}$ is the probability of the most probable state sequence $\left\{s_{1}, \ldots, s_{t}\right\}$ such that $s_{t}=\int_{k}$ ,
$V_{t, k}=\mathbb{P}\left(s_{1}, \ldots, s_{t}, y_{1}, \ldots, y_{t} \mid s_{t}=\int_{k}\right) .$
The actual Viterbi path can be obtained by, at each step, keeping track of which state index $i$ was used in the second equation. Let $\xi(k, t)$ be the function that returns the value of $i$ that was used to compute $V_{t, k}$ if $t>1$ , or $k$ if $t=1$ . Then
$\begin{aligned} s_{T} &=\int_{\text {arg }} \max _{1 \leq i \leq K}\left(V_{T, k}\right) \\ s_{t-1} &=\int_{\xi\left(s_{t}, t\right)} . \end{aligned}$

State-Space Models

The state transition probability $p\left(s_{t} \mid s_{t-1}\right)$ $p (s_{t} ∣ s_{t - 1})$ can be decomposed into deterministic and noise:
$s_{t}=F_{t}\left(s_{t-1}\right)+\epsilon_{t}$
- $\epsilon_{t}$ is zero-mean i.i.d. noise.
the emission probability $p\left(y_{t} \mid s_{t}\right)$ $p (y_{t} ∣ s_{t})$ can be decomposed as:
$y_{t}=G_{t}\left(s_{t}\right)+\xi_{t},$
- with zero-mean i.i.d. observation noise.
If the functions $F_{t}$ and $G_{t}$ are linear and time independent, then we have
$\begin{aligned} &s_{t}=A s_{t-1}+\epsilon_{t}, \\ &y_{t}=C s_{t}+\xi_{t} \end{aligned}$
where $A$ is the state transition matrix and $C$ is the observation matrix.

Particle Filtering

A Kalman filter maintains its state as moments of the multivariate Gaussian distribution, $N(\mathbf{m}, \mathbf{P})$ .
- This approach is appropriate when the state is Gaussian, or when the true distribution can be closely approximated by the Gaussian.
What if the distribution of states is, for example, bimodal?
the simplest way to approximate any distribution is by data points sampled from that distribution We refer to those data points as "particles."
The more particles we have, the more closely we can approximate the target distribution.
- The empirical distribution is then given by the histogram.
- Note that the particles need not be univariate, as in our example. They may be multivariate if we are approximating a multivariate distribution.
This setup gives rise to the family of algorithms known as particle filtering algorithms (Gordon et al. 1993; Kitagawa 1993).
- One of the most common of them is the Sequential Importance Resampling (SIR) algorithm:

Sequential Importance Resampling (SIR)

a. Initialization step: At time $t=0$ , draw $M$ i.i.d. samples from the initial distribution $\tau_{0}$ . Also, initialize $M$ normalized (to 1 $)$ weights to an identical value $\frac{1}{M}$ . For $i=1, \ldots, M$ , the samples will be denoted $\hat{\mathbf{x}}_{0|0}^{(i)}$ and the normalized weights $\lambda_{0}^{(i)}$ .
b. Recursive step: At time $t=1, \ldots, T$ $t = 1, \dots, T$ , let $\left(\hat{\mathbf{x}}_{t-1 \mid t-1}^{(i)}\right)_{i=1, \ldots, M}$ $(\hat{x}_{t - 1 ∣ t - 1}^{(i)})_{i = 1, \dots, M}$ be the particles generated at time $t-1$ $t - 1$ .
- Importance sampling: For $i=1, \ldots M$ , sample $\hat{\mathbf{t}}_{t \mid t-1}^{(i)}$ from the Markov transition kernel $\tau_{t}\left(\cdot \mid \hat{\mathbf{x}}_{t-1 \mid t-1}^{(i)}\right)$ . For $i=1, \ldots, M$ , use the observation density to compute the non-normalized weights
  $\omega_{t}^{(i)}:=\lambda_{t-1}^{(i)} \cdot p\left(\mathbf{y}_{t} \mid \hat{x}_{i \mid t-1}^{(i)}\right)$
  and the values of the normalized weights before resampling ("br")
  $\operatorname{br}_{\lambda_{t}}^{(i)}:=\frac{\omega_{t}^{(i)}}{\sum_{k=1}^{M} \omega_{t}^{(k)}} .$
- Resampling (or selection): For $i=1, \ldots, M$ , use an appropriate resampling algorithm (such as multinomial resampling-see below) to sample $\mathbf{x}_{t \mid t}^{(i)}$ from the mixture
  $\sum_{k=1}^{M}\operatorname{br}_{\lambda_{t}}^{(k)}\delta\left(\mathbf{x}_{t}-\mathbf{x}_{t \mid t-1}^{(k)}\right),$
  where $\delta(\cdot)$ denotes the Dirac delta generalized function, and set the normalized weights after resampling, $\lambda^{(i)}_t$ , appropriately (for most common resampling algorithms this means $\lambda_{t}^{(i)}:=\frac{1}{M}$ ).

Multinomial Resampling

Notice, fromabove, that we are using with the normalized weights computed before resampling, ${ }^{\mathrm{br}} \lambda_{t}^{(1)}, \ldots,{ }^{\mathrm{br}} \lambda_{t}^{(M)}$ :

a. For $i=1, \ldots, M$ , compute the cumulative sums
${ }^{b} \Lambda_{t}^{(i)}=\sum_{k=1}^{i} \mathrm{br} \lambda_{t}^{(k)},$

so that, by construction, ${ }^{\mathrm{br}} \Lambda_{t}^{(M)}=1$ .

b. Generate $M$ random samples from $\mathcal{U}(0,1), u_{1}, u_{2}, \ldots, u_{M}$ .
c. For each $i=1, \ldots, M$ , choose the particle $\hat{\mathbf{x}}_{t \mid t}^{(i)}=\hat{\mathbf{x}}_{t \mid t-1}^{(j)}$ with $j \in$ $\{1,2, \ldots, M-1\}$ such that $u_{i} \in\left[{ }^{\mathrm{br}} \Lambda_{t}^{(j)},{ }^{\mathrm{br}} \Lambda_{t}^{(j+1)}\right]$

Thus we end up with $M$ new particles (children), $\mathbf{x}_{t| t}^{(1)}, \ldots, \mathbf{x}_{t \mid t}^{(M)}$ sampled from the existing set $\mathbf{x}_{t \mid t-1}^{(1)}, \ldots, \mathbf{x}_{t \mid t-1}^{(M)}$ , so that some of the existing particles may disappear, while others may appear multiple times. For each $i=1, \ldots, M$ the number of times $\mathbf{x}_{t \mid t-1}^{(i)}$ appears in the resampled set of particles is known as the particle's replication factor, $N_{t}^{(i)}$ .

We set the normalized weights after resampling: $\lambda_{t}^{(i)}:=\frac{1}{M}$ . We could view this algorithm as the sampling of the replication factors $N_{t}^{(1)}, \ldots N_{t}^{(M)}$ from the multinomial distribution with probabilities ${ }^{b} \lambda_{t}^{(1)}, \ldots, \mathrm{hr} \lambda_{t}^{(M)}$ , respectively. Hence the name of the method.

Summary

Formulate hidden Markov models (HMMs) for probabilistic modeling over hidden states;
Gain familiarity with the Baum–Welch algorithm for fitting HMMs to time series data;
Use the Viterbi algorithm to find the most likely path;
Gain familiarity with state-space models and the application of Kalman filters to fit them; and
Apply particle filters to financial time series.