L05d Sequence Modeling

Materials are adopted from "Dixon, Matthew F., Igor Halperin, and Paul Bilokon. Machine learning in Finance. Springer International Publishing, 2020.". This handout is only for teaching. DO NOT DISTRIBUTE.

L05d Sequence Modeling

Chapter Objectives
By the end of this chapter, the reader should expect to accomplish the following:

Explain and analyze linear autoregressive models;

Understand the classical approaches to identifying, fitting, and diagnosing autoregressive models;

Apply simple heteroscedastic regression techniques to time series data;

Understand how exponential smoothing can be used to predict and filter time series; and

Project multivariate time series data onto lower dimensional spaces with principal component analysis.

Autoregressive Modeling

Preliminaries

Before we can build a model to predict $Y_t$ , we recall some basic definitions and terminology, starting with a continuous time setting and then continuing thereafter solely in a discrete-time setting.

Stochastic Process
A stochastic process is a sequence of random variables, indexed by continuous time: $\left\{Y_{t}\right\}_{t=-\infty}^{\infty}$ .

Time Series
A time series is a sequence of observations of a stochastic process at discrete times over a specific interval: $\left\{y_{t}\right\}_{t=1}^{n}$ .

Autocovariance
The $j$ th autocovariance of a time series is
$\gamma_{j t}:=\mathbb{E}\left[\left(y_{t}-\mu_{t}\right)\left(y_{t-j}-\mu_{t-j}\right)\right]$
where $\mu_{t}:=\mathbb{E}\left[y_{t}\right]$ .

Covariance (Weak) Stationarity
A time series is weak (or wide-sense) covariance stationary if it has time constant mean and autocovariances of all orders:
$\begin{aligned} \mu_{t} &=\mu, & \forall t \\ \gamma_{j t} &=\gamma_{j}, & \forall t . \end{aligned}$

Autocorrelation
The $j$ th autocorrelation, $\tau_{j}$ is just the $j$ th autocovariance divided by the variance:
$\tau_{j}=\frac{\gamma_{j}}{\gamma_{0}} .$

White Noise
White noise, $\epsilon_{t}$ , is i.i.d. error which satisfies all three conditions:
a. $\mathbb{E}\left[\epsilon_{t}\right]=0, \forall t$ ;
b. $\mathbb{V}\left[\epsilon_{t}\right]=\sigma^{2}, \forall t$ ; and
c. $\epsilon_{t}$ and $\epsilon_{s}$ are independent, $t \neq s, \forall t, s$ .

Gaussian white noise just adds a normality assumption to the error.
White noise error is often referred to as a "disturbance," "shock," or "innovation" in the financial econometrics literature.

Autoregressive Processes

describing $y_{t}$ as a linear combination of $p$ past observations and white noise

$\mathbf{A R}(\mathbf{p})$ Process
The $p$ th order autoregressive process of a variable $Y_{t}$ depends only on the previous values of the variable plus a white noise disturbance term
$y_{t}=\mu+\sum_{i=1}^{p} \phi_{i} y_{t-i}+\epsilon_{t},$
where $\epsilon_{t}$ is independent of $\left\{y_{t-1}\right\}_{i=1}^{p}$ . We refer to $\mu$ as the drift term. $p$ is referred to as the order of the model.

polynomial function $\phi(L):=\left(1-\phi_{1} L-\phi_{2} L^{2}-\cdots-\phi_{p} L^{p}\right)$
$y_{t-j}$ is a $j$ th lagged observation of $y_{t}$ given by the Lag operator or Backshift operator, $y_{t-j}=L^{j}\left[y_{j}\right]$ .
The $A R(p)$ process can be expressed in the more compact form
$\phi(L)\left[y_{t}\right]=\mu+\epsilon_{t}$

Stability

whether past disturbances exhibit an inclining or declining impact on the current value of $y$ as the lag increases

consider the $\operatorname{AR}(1)$ process and write $y_{t}$ in terms of the inverse of $\Phi(L)$
$y_{t}=\Phi^{-1}(L)\left[\mu+\epsilon_{t}\right],$
for an $\mathrm{AR}(1)$ process
$y_{t}=\frac{1}{1-\phi L}\left[\mu+\epsilon_{t}\right]=\sum_{j=0}^{\infty} \phi^{j} L^{j}\left[\mu+\epsilon_{t}\right]$
the infinite sum will be stable, i.e. the $\phi^{j}$ terms do not grow with $j$ , provided that $|\phi|<1$ .
unstable AR(p) processes exhibit the counter-intuitive behavior that the error disturbance terms become increasingly influential as the lag increases.
We can calculate the Impulse Response Function (IRF), $\frac{\partial y_{t}}{\partial \epsilon_{t-j}} \forall j$ , to characterize the influence of past disturbances.
For the AR(p) model, the IRF is given by $\phi^{j}$ and hence is geometrically decaying when the model is stable.

Stationarity

A sufficient condition for the autocorrelation function of AR(p) models convergences to zero as the lag increases is stationary.

$\Phi(z)=\left(1-\frac{z}{\lambda_{1}}\right) \cdot\left(1-\frac{z}{\lambda_{2}}\right) \cdot \ldots \cdot\left(1-\frac{z}{\lambda_{p}}\right)=0$

a $A R(p)$ model is strictly stationary and ergodic if all the roots lie outside the unit sphere in the complex plane $\mathbb{C}$ .
That is $\left|\lambda_{i}\right|>1, i \in\{1, \ldots, p\}$ and $|\cdot|$ is the modulus of a complex number.
if the characteristic equation has at least one unit root, with all other roots lying outside the unit sphere, then this is a special case of non-stationarity but not strict stationarity.

Stationarity of Random Walk
We can show that the following random walk (zero mean AR(1) process) is not strictly stationary:
$y_{t}=y_{t-1}+\epsilon_{t}$
Written in compact form gives
$\Phi(L)\left[y_{t}\right]=\epsilon_{t}, \quad \Phi(L)=1-L$
and the characteristic polynomial, $\Phi(z)=1-z=0$ , implies that the real root $z=1$ . Hence the root is on the unit circle and the model is a special case of nonstationarity.

Finding roots of polynomials is equivalent to finding eigenvalues.
The CayleyHamilton theorem states that the roots of any polynomial can be found by turning it into a matrix and finding the eigenvalues.
- Given the $p$ degree polynomial:
  $q(z)=c_{0}+c_{1} z+\ldots+c_{p-1} z^{p-1}+z^{p}$
- we define the $p \times p$ companion matrix
  $C:=\left(\begin{array}{ccccc} 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & 0 & \vdots \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & 1 \\ -c_{0} & -c_{1} & \ldots & -c_{p-2} & -c_{p-1} \end{array}\right)$
- the characteristic polynomial $\operatorname{det}(C-\lambda I)=q(\lambda)$ , and so the eigenvalues of $C$ are the roots of $q$ .

Partial Autocorrelations

The order, $p$ , of a AR(p) model can be determined from time series data provided the data is stationary.
This signature encodes the memory in the model and is given by "partial autocorrelations."
- each partial autocorrelation measures the correlation of a random variable, $y_{t}$ , with its lag, $y_{t-h}$ , while controlling for intermediate lags.

Partial Autocorrelation
A partial autocorrelation at lag $h \geq 2$ is a conditional autocorrelation between a variable, $y_{t}$ , and its $h$ th lag, $y_{t-h}$ under the assumption that the values of the intermediate lags, $y_{t-1}, \ldots, y_{t-h+1}$ are controlled:
$\tilde{\tau}_{h}:=\tilde{\tau}_{t, t-h}:=\frac{\tilde{\gamma}_{h}}{\sqrt{\tilde{\gamma}_{t, h}} \sqrt{\tilde{\gamma}_{t-h, h}}},$
where
$\tilde{\gamma}_{h}:=\tilde{\gamma}_{t, t-h}:=\mathbb{E}\left[y_{t}-P\left(y_{t} \mid y_{t-1}, \ldots, y_{t-h+1}\right), y_{t-h}-P\left(y_{t-h} \mid y_{t-1}, \ldots y_{t-h+1}\right)\right]$
is the lag- $h$ partial autocovariance, $P(W \mid Z)$ is an orthogonal projection of $\mathrm{W}$ onto the set $Z$ and
$\tilde{\gamma}_{t, h}:=\mathbb{E}\left[\left(y_{t}-P\left(y_{t} \mid y_{t-1}, \ldots, y_{t-h+1}\right)\right)^{2}\right] .$
The partial autocorrelation function $\tilde{\tau}_{h}: \mathbb{N} \rightarrow[-1,1]$ is a map $h: \mapsto \tilde{\tau}_{h}$ . The plot of $\tilde{\tau}_{h}$ against $h$ is referred to as the partial correlogram.

Maximum Likelihood Estimation

The exact likelihood when the density of the data is independent of $\left(\phi, \sigma_{n}^{2}\right)$ is
$\mathcal{L}\left(y, x ; \phi, \sigma_{n}^{2}\right)=\prod_{t=1}^{T} f_{Y_{t} \mid X_{t}}\left(y_{t} \mid x_{t} ; \phi, \sigma_{n}^{2}\right) f_{X_{t}}\left(x_{t}\right)$
the exact likelihood is proportional to the conditional likelihood function:
$\begin{aligned} \mathcal{L}\left(y, x ; \phi, \sigma_{n}^{2}\right) & \propto L\left(y \mid x ; \phi, \sigma_{n}^{2}\right) \\ &=\prod_{t=1}^{T} f_{Y_{t} \mid X_{t}}\left(y_{t} \mid x_{t} ; \phi, \sigma_{n}^{2}\right) \\ &=\left(\sigma_{n}^{2} 2 \pi\right)^{-T / 2} \exp \left\{-\frac{1}{2 \sigma_{n}^{2}} \sum_{t=1}^{T}\left(y_{t}-\phi^{T} \mathbf{x}_{t}\right)^{2}\right\} \end{aligned}$
In many cases such an assumption about the independence of the density of the data and the parameters is not warranted.
- the zero mean $\mathrm{AR}(1)$ with unknown noise variance:
  $\begin{gathered} y_{t}=\phi y_{t-1}+\epsilon_{t}, \epsilon_{t} \sim \mathcal{N}\left(0, \sigma_{n}^{2}\right) \\ Y_{t} \mid Y_{t-1} \sim \mathcal{N}\left(\phi y_{t-1}, \sigma_{n}^{2}\right) \\ Y_{1} \sim \mathcal{N}\left(0, \frac{\sigma_{n}^{2}}{1-\phi^{2}}\right) . \end{gathered}$
- The exact likelihood is
  $\begin{aligned} \mathcal{L}\left(x ; \phi, \sigma_{n}^{2}\right) &=f_{Y_{t} \mid Y_{t-1}}\left(y_{t} \mid y_{t-1} ; \phi, \sigma_{n}^{2}\right) f_{Y_{1}}\left(y_{1} ; \phi, \sigma_{n}^{2}\right) \\ &=\left(\frac{\sigma_{n}^{2}}{1-\phi^{2}} 2 \pi\right)^{-1 / 2} \exp \left\{-\frac{1-\phi^{2}}{2 \sigma_{n}^{2}} y_{1}^{2}\right\}\left(\sigma_{n}^{2} 2 \pi\right)^{-\frac{T-1}{2}} \\ & \exp \left\{-\frac{1}{2 \sigma_{n}^{2}} \sum_{t=2}^{T}\left(y_{t}-\phi y_{t-1}\right)^{2}\right\} \end{aligned}$

Heteroscedasticity

heteroscedastic $\operatorname{AR}(\mathrm{p})$ model
$y_{t}=\mu+\sum_{i=1}^{p} \phi_{i} y_{t-i}+\epsilon_{t}, \epsilon_{t} \sim \mathcal{N}\left(0, \sigma_{n, t}^{2}\right) .$
the ARCH test: The ARCH Engle’s test is constructed based on the property that if the residuals are heteroscedastic, the squared residuals are autocorrelated. The Ljung–Box test is then applied to the squared residuals
The estimation procedure for heteroscedastic models
- estimation of the errors from the maximum likelihood function which treats the errors as independent
- estimation of model parameters under a more general maximum likelihood estimation which treats the errors as time-dependent.
- The conditional likelihood is
  $\begin{aligned} \mathcal{L}\left(\mathbf{y} \mid X ; \phi, \sigma_{n}^{2}\right) &=\prod_{t=1}^{T} f_{Y_{t} \mid X_{t}}\left(y_{t} \mid x_{t} ; \phi, \sigma_{n}^{2}\right) \\ &=(2 \pi)^{-T / 2} \operatorname{det}(D)^{-1 / 2} \exp \left\{-\frac{1}{2}\left(\mathbf{y}-\phi^{T} X\right)^{T} D^{-1}\left(\mathbf{y}-\phi^{T} X\right)\right\} \end{aligned}$
  where $D_{t t}=\sigma_{n, t}^{2}$ is the diagonal covariance matrix and $X \in \mathbb{R}^{T \times p}$ is the data matrix defined as $[X]_{t}=\mathbf{x}_{t}$ .

Moving Average Processes

The Wold representation theorem (a.k.a. Wold decomposition): every covariance stationary time series can be written as the sum of two time series, one deterministic and one stochastic.
- the deterministic component: AR(p) process.
- the stochastic component: "moving average process" or $\mathrm{MA}(q)$ process

$\mathbf{M A}(\mathbf{q})$ Process
The $q$ th order moving average process is the linear combination of the white noise process $\left\{\epsilon_{t-i}\right\}_{t=0}^{q}, \forall t$
$y_{t}=\mu+\sum_{i=1}^{q} \theta_{i} \epsilon_{t-i}+\epsilon_{t}$

an AR(1) process can be rewritten as a $\mathrm{MA}(\infty)$ process.
- Suppose that the AR(1) process has a mean $\mu$ and the variance of the noise is $\sigma_{n}^{2}$ , then by a binomial expansion of the operator $(1-\phi L)^{-1}$ we have
  $y_{t}=\frac{\mu}{1-\phi}+\sum_{j=0}^{\infty} \phi^{j} \epsilon_{t-j}$
- where the moments can be easily found and are
  $\begin{aligned} \mathbb{E}\left[y_{t}\right] &=\frac{\mu}{1-\phi} \\ \mathbb{V}\left[y_{t}\right] &=\sum_{j=0}^{\infty} \phi^{2 j} \mathbb{E}\left[\epsilon_{t-j}^{2}\right] \\ &=\sigma_{n}^{2} \sum_{j=0}^{\infty} \phi^{2 j}=\frac{\sigma_{n}^{2}}{1-\phi^{2}} . \end{aligned}$

GARCH

Generalized Autoregressive Conditional Heteroscedastic (GARCH) model
- parametric
- linear
- heteroscedastic
GARCH(p,q) model specifies that the conditional variance (i.e., volatility) is given by an $\operatorname{ARMA}(\mathrm{p}, \mathrm{q})$ model-there are $p$ lagged conditional variances and $q$ lagged squared noise terms:

$\sigma_{t}^{2}:=\mathbb{E}\left[\epsilon_{t}^{2} \mid \Omega_{t-1}\right]=\alpha_{0}+\sum_{i=1}^{q} \alpha_{i} \epsilon_{t-i}^{2}+\sum_{i=1}^{p} \beta_{i} \sigma_{t-i}^{2}$

A necessary condition for model stationarity is the following constraint:
$\left(\sum_{i=1}^{q} \alpha_{i}+\sum_{i=1}^{p} \beta_{i}\right)<1$
When the model is stationary, the long-run volatility converges to the unconditional variance of $\epsilon_{t}$ :
$\sigma^{2}:=\operatorname{var}\left(\epsilon_{t}\right)=\frac{\alpha_{0}}{1-\left(\sum_{i=1}^{q} \alpha_{i}+\sum_{i=1}^{p} \beta_{i}\right)}$

Exponential Smoothing

the setting
- smoothing factor / smoothing coefficient: $\alpha\in(0,1)$
- smoothed predictions: $\tilde{y}_{t+1}$
- forecast error: $y_{t}-\tilde{y}_{t}$
the model
$\tilde{y}_{t+1}=\tilde{y}_{t}+\alpha\left(y_{t}-\tilde{y}_{t}\right),$
or equivalently
$\tilde{y}_{t+1}=\alpha y_{t}+(1-\alpha) \tilde{y}_{t} .$

Fitting Time Series Models: The Box-Jenkins Approach

The three basic steps of the Box-Jenkins modeling approach:

(I)dentification: determining the order of the model (a.k.a. model selection);
(E)stimation: estimation of model parameters;
(D)iagnostic checking: evaluating the fit of the model.

Stationarity

Before the order of the model can be determined, the time series must be tested for stationarity.
Augmented Dickey-Fuller (ADF) test
- A standard statistical test for covariance stationarity
- accounts for the (c)onstant drift and (t)ime trend
Attempting to fit a time series model to non-stationary data will result in dubious interpretations of the estimated partial autocorrelation function and poor predictions and should therefore be avoided.

Transformation to Ensure Stationarity

Any trending time series process is non-stationary.
Detrending methods
- differencing
- Kalman filters
- Markov-switching models
- advanced neural networks

Identification

partial correlogram

Inforation Criterion

use the Akaike Information Criteria (AIC) to measure the quality of fit
$AIC=\ln \left(\hat{\sigma}^{2}\right)+\frac{2 k}{T} .$
- $\hat{\sigma}^{2}$ is the residual variance
- $k=p+q+1$ is the total number of parameters estimated.
The goal is to select the model which minimizes the AIC by first using maximum likelihood estimation and then adding the penalty term.
the AIC favors the best fit with the fewest number of parameters.
It is similar with regularization in machine learning where the loss function is penalized by a LASSO penalty $\left(L_{1}\right.$ norm of the parameters) or a ridge penalty ( $L_{2}$ norm of the parameters).
AIC is estimated post-hoc, once the maximum likelihood function is evaluated, whereas in machine learning models, the penalized loss function is directly minimized.

Model Diagnostics

Once the model is fitted we must assess whether the residual exhibits autocorrelation, suggesting the model is underfitting.
The residual of fitted time series model should be white noise.

A short summary of some of the most useful diagnostic tests for time series modeling in ﬁnance

Name	Description
Chi-squared test	Used to determine whether the confusion matrix of a classiﬁer is statistically signiﬁcant, or merely white noise
t-test	Used to determine whether the output of two separate regression models are statistically different on i.i.d. data
Mariano-Diebold test	Used to determine whether the output of two separate time series models are statistically different
ARCH test	The ARCH Engle’s test is constructed based on the property that if the residuals are heteroscedastic, the squared residuals are autocorrelated. The Ljung–Box test is then applied to the squared residuals
Portmanteau test	A general test for whether the error in a time series model is auto-correlated Example tests include the Box-Ljung and the Box-Pierce test

Time Series Cross-Validation

Summary

We have covered the following objectives:

Explain and analyze linear autoregressive models;
Understand the classical approaches to identifying, fitting, and diagnosing autoregressive models;
Apply simple heteroscedastic regression techniques to time series data;
Understand how exponential smoothing can be used to predict and filter time series