L01 Introduction of Machine Learning

Materials are adopted from "Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.". This handout is only for teaching. DO NOT DISTRIBUTE.

What is ML

Supervised Learning

Classification

classification problem

Example: classifying Iris flowers

Iris Flowers Classification

Image Classification

Exploratory data analysis

Learning a classifier

Empirical risk minimization

Uncertainty

We must avoid] false confidence bred from an ignorance of the probabilistic nature of the world, from a desire to see black and white where we should rightly see gray.

--- Immanuel Kant, as paraphrased by Maria Konnikova

Maximum likelihood estimation

Regression

regression problem

(simple) Linear regression

Polynomial regression

Deep neural networks

Overfitting and generalization

Evaluating ML Algorithm Performance Errors & Overfitting

learning curve

fitting curve

Evaluating ML Algorithm Performance: Preventing Overfitting in Supervised ML

No free lunch theorem

All models are wrong, but some models are useful.

--- George Box

Unsupervised learning

When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 neural connections. And you only live for 109 seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.

--- Geoffrey Hinton, 1996

Clustering

Discovering latent “factors of variation”

Reinforcement learning

Markov Decision Process

The MDP is the sequence of random variables (XnX_n) which describes the stochastic evolution of the system states. Of course the distribution of (XnX_n) depends on the chosen actions.

A control π\pi is a sequence of decision rules (fnf_n) with fn:EAf_n:E\rightarrow A where fn(x)Dn(x)f_n(x)\in D_n(x) determines for each possible state xEx\in E the next action fn(x)f_n(x) at time nn. Such a sequence π=(fn)\pi=(f_n) is called policy or strategy. Formally the Markov Decision Problem is given by

V0(x)=supπExπ[k=0N1rk(Xk,fk(Xk))+gN(XN)], xE.V_0(x)=\sup_{\pi}\mathbb{E}_x^{\pi}\left[\sum_{k=0}^{N-1}r_k\left(X_k,f_k(X_k)\right)+g_N(X_N)\right],\text{ }x\in E.

Applications of MDP: Comsumption Problem

Suppose there is an investor with given initial capital. At the beginning of each of NN periods she can decide how much of the capital she consumes and how much she invests into a risky asset. The amount she consumes is evaluated by a utility function UU as well as the terminal wealth. The remaining capital is invested into a risky asset where we assume that the investor is small and thus not able to influence the asset price and the asset is liquid. How should she consume/invest in order to maximize the sum of her expected discounted utility?

Applications of MDP: Cash Balance or Inventory Problem

Imagine a company which tries to find the optimal level of cash over a finite number of NN periods. We assume that there is a random stochastic change in the cash reserve each period (due to withdrawal or earnings). Since the firm does not earn interest from the cash position, there are holding cost for the cash reserve if it is positive, but also interest (cost) in case it is negative. The cash reserve can be increased or decreased by the management at each decision epoch which implies transfer costs. What is the optimal cash balance policy?

Applications of MDP: Mean-Variance Problem

Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after NN periods, the one with smallest portfolio variance. What is the optimal investment strategy?

Applications of MDP: Dividend Problem in Risk Theory

Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?

Applications of MDP: Bandit Problem

Suppose we have two slot machines with unknown success probability θ1\theta_1 and θ2\theta_2. At each stage we have to choose one of the arms. We receive one Euro if the arm wins, else no cash flow appears. How should we play in order to maximize our expected total reward over NN trials?

Applications of MDP: Pricing of American Options

In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.

Discussion

Statistical inference vs. Supervised machine learning

Property Statistical inference Supervised machine learning
Goal Causal models with explanatory power Prediction performance, often with limited explanatory power
Data The data is generated by a model The data generation process is unknown
Framework Probabilistic Algorithmic and Probabilistic
Expressibility Typically linear Non-linear
Model selection Based on information criteria Numerical optimization
Scalability Limited to lower-dimensional data Scales to high-dimensional input data
Robustness Prone to over-fitting Designed for out-of-sample performance
Diagnostics Extensive Limited

Financial Econometrics and Machine Learning

ML Algorithm Types

Selecting ML Algorithms

Useful Python Libraries

Math Libraries

Statistical Libraries

ML and Deep Learning

Decision Theory

Bayesian decision theory

Basics

The decision maker, or agent, has a set of possible actions, A\mathcal{A}, to choose from. Each of these actions has costs and benefits, which will depend on the underlying state of nature HHH\in\mathcal{H}. We can encode this information into a loss function l(h,a)\mathcal{l}(h, a), that specifies the loss we incur if we take action aAa\in\mathcal{A} when the state of nature is hHh\in\mathcal{H}.

R(ax)=Ep(hx)[l(h,a)]=hHl(h,a)p(hx)R(a|\mathbf{x})=\mathbb{E}_{p(h|\mathbf{x})}[\mathcal{l}(h,a)]=\sum_{h\in\mathcal{H}}\mathcal{l}(h,a)p(h|\mathbf{x})

π(x)=arg minaAEp(hx)[l(h,a)]\pi^*(\mathbf{x})=\argmin_{a\in\mathcal{A}}\mathbb{E}_{p(h|\mathbf{x})}[\mathcal{l}(h,a)]

π(x)=arg maxaAEp(hx)[U(h,a)]\pi^*(\mathbf{x})=\argmax_{a\in\mathcal{A}}\mathbb{E}_{p(h|\mathbf{x})}[U(h,a)]

Classification problems

We use Bayesian decision theory to decide the optimal class label to predict given an observed input xXx\in X.

Zero-one loss

Suppose the states of nature correspond to class labels, so H=Y={1,...,C}\mathcal{H} = \mathcal{Y} = \{1, . . . , C\}. Furthermore, suppose the actions also correspond to class labels, so A=Y\mathcal{A} = \mathcal{Y}. In this setting, a very commonly used

y^=0\hat{y}=0 y^=1\hat{y}=1
y=0y^*=0 00 11
y=1y^*=1 11 00

R(y^x)=p(y^yx)=1p(y^=yx)R(\hat{y}|\mathbf{x})=p(\hat{y}\neq y^*|\mathbf{x})=1-p(\hat{y}= y^*|\mathbf{x})

π(x)=arg maxyYp(yx)\pi(\mathbf{x})=\argmax_{y\in\mathcal{Y}}p(y|\mathbf{x})

It corresponds to the mode of the posterior distribution, also known as the maximum a posteriori or MAP estimate

ROC curves

Class confusion matrices

For any fixed threshold τ\tau , we consider the following decision rule:

y^τ(x)=I(p(y=1x)1τ)\hat{y}_\tau(\mathbf{x})=\mathbb{I}(p(y=1|\mathbf{x})\geq1-\tau)

FPτ=n=1NI(y^τ(x)=1,yn=0)FP_\tau=\sum_{n=1}^N\mathbb{I}(\hat{y}_\tau(\mathbf{x})=1,y_n=0)

TPRτ=p(y^=1y=1,τ)=TPτTPτ+FNτTPR_\tau=p(\hat{y}=1|y=1,\tau)=\frac{TP_\tau}{TP_\tau+FN_\tau}

FPRτ=p(y^=1y=0,τ)=FPτFPτ+TNτFPR_\tau=p(\hat{y}=1|y=0,\tau)=\frac{FP_\tau}{FP_\tau+TN_\tau}

Summarizing ROC curves as a scalar

Precision-recall curves

Computing precision and recall

P(τ)=p(y=1y^=1,τ)=TPτTPτ+FPτ\mathcal{P}(\tau)=p(y=1|\hat{y}=1,\tau)=\frac{TP_\tau}{TP_\tau+FP_\tau}

R(τ)=p(y^=1y=1,τ)=TPτTPτ+FNτ\mathcal{R}(\tau)=p(\hat{y}=1|y=1,\tau)=\frac{TP_\tau}{TP_\tau+FN_\tau}

P(τ)=nyny^nny^n\mathcal{P}(\tau)=\frac{\sum_ny_n\hat{y}_n}{\sum_n\hat{y}_n}
R(τ)=nyny^nnyn\mathcal{R}(\tau)=\frac{\sum_ny_n\hat{y}_n}{\sum_ny_n}

Summarizing PR curves as a scalar
F-scores

1Fβ=11+β21P+β21+β21R\frac{1}{F_\beta}=\frac{1}{1+\beta^2}\frac{1}{\mathcal{P}}+\frac{\beta^2}{1+\beta^2}\frac{1}{\mathcal{R}}

Fβ=(1+β2)PRβ2P+R=(1+β2)TP(1+β2)TP+β2FN+FPF_\beta=(1+\beta^2)\frac{\mathcal{P}\cdot\mathcal{R}}{\beta^2\mathcal{P}+\mathcal{R}}=\frac{(1+\beta^2)TP}{(1+\beta^2)TP+\beta^2FN+FP}

Regression problems

We assume the set of actions and states are both equal to the real
line, A=H=R\mathcal{A} = \mathcal{H} = \mathbb{R}.

L2 loss

2(h,a)=(ha)2\ell_2(h,a)=(h-a)^2

R(ax)=E[(ha)2x]=E[h2x]2aE[hx]+a2R(a|\mathbf{x})=\mathbb{E}[(h-a)^2|\mathbf{x}]=\mathbb{E}[h^2|\mathbf{x}]-2a\mathbb{E}[h|\mathbf{x}]+a^2

RaR(ax)=2E[hx]+2a=0π(x)=E[hx]hp(hx)dh\frac{\partial R}{\partial a}R(a|\mathbf{x})=-2\mathbb{E}[h|\mathbf{x}]+2a=0\Rightarrow \pi(\mathbf{x})=\mathbb{E}[h|\mathbf{x}]\int hp(h|\mathbf{x})dh

L1 loss

1(h,a)=ha\ell_1(h,a)=|h-a|

Pr(h<ax)=Pr(hax)=0.5\Pr(h<a^*|\mathbf{x})=\Pr(h\geq a^*|\mathbf{x})=0.5

Huber loss

Let r=har=h-a,

lδ(h,a)={r2/2 if rδδrδ2/2 if r>δl_\delta(h,a)=\left\{\begin{array}{ll}r^2/2&\text{ if }|r|\leq\delta\\\delta|r|-\delta^2/2&\text{ if }|r|>\delta\end{array}\right.

Probabilistic prediction problems

We assume the true “state of nature” is a distribution, h=p(Yx)h = p(Y |x), the action is another distribution, a=q(Yx)a = q(Y |x), and we want to pick qq to minimize E[l(p,q)]\mathbb{E}[l(p, q)] for a given xx.

KL, cross-entropy and log-loss

A common form of loss functions for comparing two distributions is the Kullback Leibler divergence, or KL divergence, which is defined as follows:

DKL(pq)yYp(y)logp(y)q(y)D_{\mathbb{KL}(p||q)}\sum_{y\in\mathcal{Y}}p(y)\log \frac{p(y)}{q(y)}

q(Yx)=arg minqH(q(Yx),q(Yx))q^*(Y|x)=\argmin_q\mathbb{H}(q(Y|x),q(Y|x))

Proper scoring rules

l(p,p)l(p,q), with equality iff p=ql(p,p)\leq l(p,q)\text{, with equality iff }p=q

l(p,q)=1Cc=1C(q(y=cx)p(y=cx))2l(p,q)=\frac{1}{C}\sum_{c=1}^C(q(y=c|x)-p(y=c|x))^2

Choosing the "right" model

Bayesian hypothesis testing

B1,0=p(DM1)p(DM0)B_{1,0}=\frac{p(\mathcal{D}|M_1)}{p(\mathcal{D}|M_0)}

Bayes factor BF(1,0)BF(1,0) Interpretation
BF<1/100BF<1/100 Decisive evidence for M0M_0
BF<1/10BF<1/10 Strong evidence for M0M_0
1/10<BF<1/31/10<BF<1/3 Moderate evidence for M0M_0
1/3<BF<11/3<BF<1 Weak evidence for M0M_0
1<BF<31<BF<3 Weak evidence for M1M_1
3<BF<103<BF<10 Moderate evidence for M1M_1
BF>10BF>10 Strong evidence for M1M_1
BF>100BF>100 Decisive evidence for M1M_1
Example: Testing if a coin is fair

Bayesian model selection

m^=arg maxmMp(mD)\hat{m}=\argmax_{m\in\mathcal{M}}p(m|\mathcal{D})

p(mD)=p(Dm)p(m)mMp(Dm)p(m)p(m|\mathcal{D})=\frac{p(\mathcal{D}|m)p(m)}{\sum_{m\in\mathcal{M}}p(\mathcal{D}|m)p(m)}

p(m)=1Mp(m)=\frac{1}{|\mathcal{M}|}

p(Dm)=p(Dθ,m)p(θm)dθp(\mathcal{D}|m)=\int p(\mathcal{D}|\theta,m)p(\theta|m)d\theta

Example: polynomial regression

Occam's razor

Connection between cross validation and marginal likelihood

Marginal likelihood is closely related to the leave-one-out cross-validation (LOO-CV) estimate.

p(Dm)=n=1Np(yny1:n1,x1:N,m)=n=1Np(ynxn,D1:n1,m)p(\mathcal{D}|m)=\prod_{n=1}^Np(y_n|y_{1:n-1},x_{1:N},m)=\prod_{n=1}^Np(y_n|x_n,\mathcal{D}_{1:n-1},m)

where

p(yx,D1:n1,m)=p(yx,θ)p(θD1:n1,m)dθp(y|x,\mathcal{D}_{1:n-1},m)=\int p(y|x,\theta)p(\theta|\mathcal{D}_{1:n-1},m)d\theta

Suppose we use a plugin approximation to the above distribution to get

p(yx,D1:n1,m)p(yx,θ)(θθ^m(D1:n1))dθ=p(yx,θ^m(D1:n1))p(y|x,\mathcal{D}_{1:n-1},m)\approx\int p(y|x,\theta)(\theta-\hat{\theta}_m(\mathcal{D}_{1:n-1}))d\theta=p(y|x,\hat{\theta}_m(\mathcal{D}_{1:n-1}))

Then we get

logp(Dm)n=1Nlogp(yx,θ^m(D1:n1))\log p(\mathcal{D}|m)\approx\sum_{n=1}^N\log p(y|x,\hat{\theta}_m(\mathcal{D}_{1:n-1}))

Information criteria

The Bayesian information criterion (BIC)

logp(Dm)logp(Dθ^map)+logp(θ^map)12H\log p(\mathcal{D}|m)\approx\log p(\mathcal{D}|\hat{\theta}_{map})+\log p(\hat{\theta}_{map})-\frac{1}{2}|\mathbf{H}|

where H is the Hessian of the negative log joint logp(D,θ)\log p(D, θ) evaluated at the MAP estimate θ^map\hat{θ}_{map}.

logp(Dm)logp(Dθ^)12H\log p(\mathcal{D}|m)\approx\log p(\mathcal{D}|\hat{\theta})-\frac{1}{2}|\mathbf{H}|

JBIC(m)=logp(Dθ^map)Dm2logNJ_{BIC}(m)=\log p(\mathcal{D}|\hat{\theta}_{map})-\frac{D_m}{2}\log N

LBIC(m)=2logp(Dθ^map)+logN\mathcal{L}_{BIC}(m)=-2\log p(\mathcal{D}|\hat{\theta}_{map})+\log N

Akaike information criterion

LAIC(m)=2logp(Dθ^,m)+2D\mathcal{L}_{AIC}(m)=-2\log p(\mathcal{D}|\hat{\theta},m)+2D

Minimum description length (MDL)

LMDL(m)=logp(Dθ^,m)+C(m)\mathcal{L}_{MDL}(m)=-\log p(\mathcal{D}|\hat{\theta},m)+C(m)

Frequentist decision theory

Computing the risk of an estimator

We define the frequentist risk of an estimator π given an unknown state of nature θ\theta to be the expected loss when applying that estimator to data xx sampled from the likelihood function p(xθ)p(x|\theta):

R(θ,π)=Ep(xθ)[l(θ,π(x))]R(\theta,\pi)=\mathbb{E}_{p(x|\theta)}[l(\theta,\pi(\mathbf{x}))]

Example: estimate a Gaussian mean

πκ(D)NN+κπˉ+κN+κθ0=wxˉ+(1w)θ0\pi_\kappa(\mathcal{D})\frac{N}{N+\kappa}\bar{\pi}+\frac{\kappa}{N+\kappa}\theta_0=w\bar{x}+(1-w)\theta_0

Bayes risk
Maximum risk

Empirical risk minimization

Empirical risk

R(f,p)=R(f)=Ep(x)P(yx)[l(y,f(x))]R(f,p^*)=R(f)=\mathbb{E}_{p^*(x)P^*(y|x)}[l(y,f(x))]

Approximation error vs estimation error
Regularized risk

Structural risk

Cross-validation

Information Theory

Entropy

The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution. We can also use entropy to define the information content of a data source.

Entropy of pp Hard/Easy to predict XnX_n Information content of D\mathcal{D}
high hard high
low easy low

Entropy for discrete random variables

The entropy of a discrete random variable XX with distribution pp over KK states is defined by

H(X)=k=1Kp(X=k)log2p(X=k)=EX[logp(X)]\mathbb{H}(X)=-\sum^K_{k=1}p(X=k)\log_2p(X=k)=-\mathbb{E}_X[\log p(X)]

Cross entropy

The cross entropy between distribution pp and qq is defined by

H(p,q)=k=1Kpklogqk\mathbb{H}(p,q)=-\sum^K_{k=1}p_k\log q_k

The cross entropy is the expected number of bits needed to compress some data samples drawn from distribution pp using a code based on distribution qq. This can be minimized by setting q=pq = p, in which case the expected number of bits of the optimal code is H(p,p)=H(p)\mathbb{H}(p, p) = \mathbb{H}(p) — this is known as Shannon’s source coding theorem.

Joint entropy

The joint entropy of two random variables XX and YY is defined as

H(X,Y)=x,yp(x,y))log2p(x,y)\mathbb{H}(X,Y)=-\sum_{x,y}p(x,y))\log_2 p(x,y)

For example, consider choosing an integer from 1 to 8, n{1,...,8}n\in\{1, . . . , 8\}. Let X(n)=1X(n) = 1 if nn is even, and Y(n)=1Y (n) = 1 if nn is prime:

nn 1 2 3 4 5 6 7 8
XX 0 1 0 1 0 1 0 1
YY 0 1 1 0 1 0 1 0

The joint distribution is

p(X,Y)p(X,Y) Y=0Y=0 Y=1Y=1
X=0X=0 1/81/8 3/83/8
Y=1Y=1 3/83/8 1/81/8

so the joint entropy is given by

H(X,Y)=[18log218+38log238+38log238+18log218]=1.81 bits\mathbb{H}(X,Y)=-\left[\frac{1}{8}\log_2\frac{1}{8}+\frac{3}{8}\log_2\frac{3}{8}+\frac{3}{8}\log_2\frac{3}{8}+\frac{1}{8}\log_2\frac{1}{8}\right]=1.81\ bits

Consider the marginal entropys:

H(X)=H(Y)=[12log212+12log212]=1\mathbb{H}(X)=\mathbb{H}(Y)=-\left[\frac{1}{2}\log_2\frac{1}{2}+\frac{1}{2}\log_2\frac{1}{2}\right]=1

We observe that

1.81=H(X,Y)<H(X)+H(Y)=21.81=\mathbb{H}(X,Y)<\mathbb{H}(X)+\mathbb{H}(Y)=2

In general, the above inequality is always valid. If X and Y are independent, then H(X,Y)=H(X)+H(Y)\mathbb{H} (X, Y ) = \mathbb{H }(X) + \mathbb{H} (Y ), so the bound is tight. This makes intuitive sense: when the parts are correlated in some way, it reduces the “degrees of freedom” of the system, and hence reduces the
overall entropy.

Another observation is that

H(X,Y)max{H(X)+H(Y)}\mathbb{H}(X,Y)\geq\max\left\{\mathbb{H}(X)+\mathbb{H}(Y)\right\}

this says combining variables together does not make the entropy go down: you cannot reduce uncertainty merely by adding more unknowns to the problem, you need to observe some data.

Conditional entropy

The conditional entropy of YY given XX is the uncertainty we have in YY after seeing XX, averaged over possible values for XX:

H(YX)=Ep(X)[E(p(YX))]=H(X,Y)H(X)\mathbb{H}(Y|X)=\mathbb{E}_{p(X)}\left[\mathbb{E}\left(p\left(Y|X\right)\right)\right]=\mathbb{H}(X,Y)-\mathbb{H}(X)

It is straight forward to verify that:

Perplexity

The perplexity of a discrete probability distribution pp is defined as

perplexity(p)=2H(p)\text{perplexity}(p)=2^{\mathbb{H}(p)}

It is often interpreted as a measure of predictability. Suppose we have an empirical distribution based on data D\mathcal{D}:

pD(xD)=1Nn=1Nδxn(x)p_{\mathcal{D}}(x|\mathcal{D})=\frac{1}{N}\sum_{n=1}^N\delta_{x_n}(x)

We can measure how well pp predicts D\mathcal{D} by computing

perplexity(pD,p)=2H(pD,p)\text{perplexity}(p_{\mathcal{D}},p)=2^{\mathbb{H}(p_{\mathcal{D}},p)}

Differential entropy for continuous random variables *

If XX is a continuous random variable with pdf p(x)p(x), we define the differential entropy as

h(X)=Xdxp(x)logp(x)h(X)=-\int_{\mathcal{X}}dxp(x)\log p(x)

Differential entropy can be negative since pdf’s can be bigger than 1.

Example: Entropy of a Gaussian

The entropy of a d-dimensional Gaussian is

h(N(μ,Σ))=12ln2πeΣ=12ln[(2πe)dΣ]=d2+d2ln(2π)+12lnΣh(\mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}))=\frac{1}{2}\ln |2\pi e\mathbf{\Sigma}|=\frac{1}{2}\ln[(2\pi e)^d|\mathbf{\Sigma}|]=\frac{d}{2}+\frac{d}{2}\ln(2\pi)+\frac{1}{2}\ln|\mathbf{\Sigma}|

In the 1d case, this becomes

h(N(μ,σ2))=12ln[2πeσ2]h(\mathcal{N}(\mu,\sigma^2))=\frac{1}{2}\ln[2\pi e\sigma^2]

Linear Algebra

Matrix calculus

Derivatives

f(x)=limh0f(x+h)f(x)hf'(x)=\lim_{h\rightarrow0}\frac{f(x+h)-f(x)}{h}

Gradients

Directional derivative

The directional derivative measures how much the function f:RnRf:\mathbb{R}^n\rightarrow\mathbb{R} changes along a direction v\mathbf{v} in space. It is defined as follows

Dvf(x)=limh0f(x+hv)f(x)h\mathcal{D}_{\mathbf{v}}f(\mathbf{x})=\lim_{h\rightarrow0}\frac{f(\mathbf{x}+h\mathbf{v})-f(\mathbf{x})}{h}

Note that the directional derivative along v\mathbf{v} is the scalar product of the gradient g and the vector v\mathbf{v}:

Dvf(x)=f(x)v\mathcal{D}_{\mathbf{v}}f(\mathbf{x})=\nabla f(\mathbf{x})\cdot\mathbf{v}

Total derivative

Suppose that some of the arguments to the function depend on each other. Concretely, suppose the function has the form f(t,x(t),y(t))f(t, x(t), y(t)). We define the total derivative of ff wrt t as follows:

dfdt=ft+fxdxdt+fydydt\dfrac{df}{dt}=\dfrac{\partial f}{\partial t}+\dfrac{\partial f}{\partial x}\dfrac{d x}{d t}+\dfrac{\partial f}{\partial y}\dfrac{d y}{d t}

If we multiply both sides by the differential dtdt, we get the total differential

df=ftdt+fxdx+fydydf=\dfrac{\partial f}{\partial t}dt+\dfrac{\partial f}{\partial x}dx+\dfrac{\partial f}{\partial y}dy

This measures how much f changes when we change tt, both via the direct effect of tt on ff, but also indirectly, via the effects of t on xx and yy.

Jacobian

Consider a function that maps a vector to another vector, f:RnRm\mathbf{f}:\mathbb{R}^n\rightarrow\mathbb{R}^m. The Jacobian matrix of this function is an m×nm \times n matrix of partial derivatives:

Jf(x)=fxT=(f1x1f1xnfmx1fmxn)=(f1(x)Tfm(x)T)\mathbf{J_f}(\mathbf{x})=\dfrac{\partial \mathbf{f}}{\partial \mathbf{x}^T}=\left(\begin{array}{ccc}\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\\vdots&\ddots&\vdots\\\frac{\partial f_m}{\partial x_1} & \cdots &\frac{\partial f_m}{\partial x_n}\end{array}\right)=\left(\begin{array}{c}\nabla f_1(\mathbf{x})^T\\\vdots\\\nabla f_m(\mathbf{x})^T\end{array}\right)

Multiplying Jacobians and vectors

The Jacobian vector product or JVP is defined to be the operation that corresponds to right multiplying the Jacobian matrix JRm×n\mathbf{J}\in\mathbb{R}^{m\times n} by a vector vRn\mathbf{v}\in\mathbb{R}^{n}:

Jf(x)v=(f1(x)Tfm(x)T)v=(f1(x)Tvfm(x)Tv)\mathbf{J_f}(\mathbf{x})\mathbf{v}=\left(\begin{array}{c}\nabla f_1(\mathbf{x})^T\\\vdots\\\nabla f_m(\mathbf{x})^T\end{array}\right)\mathbf{v}=\left(\begin{array}{c}\nabla f_1(\mathbf{x})^T\mathbf{v}\\\vdots\\\nabla f_m(\mathbf{x})^T\mathbf{v}\end{array}\right)

So we can see that we can approximate this numerically using just 2 calls to f\mathbf{f}.

The vector Jacobian product or *VJP is defined to be the operation that corresponds to left-multiplying the Jacobian matrix JRm×n\mathbf{J}\in\mathbb{R}^{m\times n} by a vector uRm\mathbf{u}\in\mathbb{R}^{m}:

uTJf(x)=uT(fx1,,fxn)=(ufx1,,ufxn)\mathbf{u}^T\mathbf{J_f}(\mathbf{x})=\mathbf{u}^T\left(\frac{\partial \mathbf{f}}{\partial x_1},\cdots,\frac{\partial \mathbf{f}}{\partial x_n}\right)=\left(\mathbf{u}\cdot\frac{\partial \mathbf{f}}{\partial x_1},\cdots,\mathbf{u}\cdot\frac{\partial \mathbf{f}}{\partial x_n}\right)

Jacobian of a composition

Let h(x)=g(f(x))h(\mathbf{x}) = g(f(\mathbf{x})). By the chain rule of calculus, we have

Jh(x)=Jg(f(x))Jf(x)\mathbf{J_h}(\mathbf{x})=\mathbf{J_g}(f(\mathbf{x}))\mathbf{J_f}(\mathbf{x})

Hessian

For a function f : Rn → R that is twice differentiable, we define the Hessian matrix as the (symmetric) n×nn \times n matrix of second partial derivatives:

Hf(x)=2fx2=2f=(2fx122fx1xn2fxnx12fxn2)\mathbf{H_f}(\mathbf{x})=\dfrac{\partial^2 f}{\partial \mathbf{x}^2}=\nabla^2f=\left(\begin{array}{ccc}\frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n}\\\vdots&\ddots&\vdots\\\frac{\partial^2 f}{\partial x_n\partial x_1} & \cdots &\frac{\partial^2 f}{\partial x_n^2}\end{array}\right)

The Hessian is the Jacobian of the gradient.

Gradients of commonly used functions