A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
--Tom Mitchell
The probabilistic approach
treat all unknown quantities as random variables
it is the optimal approach to decision making under uncertainty
Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic thinking is essential.
---Shakir Mohamed, DeepMind
Machine Learning vs. Statistic Approaches
Statistical approaches rely on foundational assumptions and explicit models of structure, such as observed samples that are assumed to be drawn from a specified underlying probability distribution.
Machine learning seeks to extract knowledge from large amounts of data with no such restrictions -- “find the pattern, apply the pattern.”
Supervised Learning vs. Unsupervised Learning
Supervised learning involves ML algorithms that infer patterns between a set of inputs (the X’s) and the desired output (Y) with a labeled data set.
Unsupervised learning is machine learning that does not make use of labeled data. In unsupervised learning, inputs (X’s) are used for analysis without any target (Y) being supplied. The algorithm seeks to discover structure within the data themselves. Two important types of problems in unsupervised learning are dimension reduction and clustering.
Deep Learning and Reinforcement Learning
In deep learning, sophisticated algorithms address highly complex tasks, such as image classification, face recognition, speech recognition, and natural language processing.
reinforcement learning, a computer learns from interacting with itself (or data generated by the same algorithm).
Neural networks (NNs, also called artificial neural networks, or ANNs) include highly flexible ML algorithms that have been successfully applied to a variety of tasks characterized by non-linearities and interactions among features.
Besides being commonly used for classification and regression, neural networks are also the foundation for deep learning and reinforcement learning, which can be either supervised or unsupervised.
Supervised Learning
Definition
The task T is to learn a mapping f from inputs x∈X to outputs y∈Y
The inputs x are also called the features, covariates, or predictors
The outputs y are also called the label, target, or response
The experince E is the training setD={(xn,yn)}n=1N
Classification
classification problem
the output space is a set of C unordered and mutually exclusive labels known as classes, Y={1,2,...,C}.
The problem is also called pattern recognition.
binary classification: just two classes, often denoted by
y∈{0,1}
y∈{−1,+1}
Example: classifying Iris flowers
Iris Flowers Classification
Image Classification
Exploratory data analysis
exploratory data analysis: see if there are any obvious patterns
tabular data with a small number of features: pair plot
higher-dimensional data: dimension reduction first and then to visualize the data in 2d or 3d
Learning a classifier
decision rule via a 1 dimensional (1d) decision boundary f(x;θ)={Setosa if petal lenth<2.45Versicolor or Virginica otherwise
decision tree a more sophisticated decision rule involves a 2d decision surface
Empirical risk minimization
misclassification rate on the training set: L(θ)=N1n=1∑NI(yn=f(xn;θ))
loss function: l(y,y^)
empirical risk: e the average loss of the predictor on the training set L(θ)=N1n=1∑Nl(yn,f(xn;θ))
model fitting / training via empirical risk minimization θ^=θargminL(θ)=θargminN1n=1∑Nl(yn,f(xn;θ))
Uncertainty
We must avoid] false confidence bred from an ignorance of the probabilistic nature of the world, from a desire to see black and white where we should rightly see gray.
--- Immanuel Kant, as paraphrased by Maria Konnikova
Two types of uncertainties
epistemic uncertainty or model uncertainty: due to
lack of knowledge of the input-output mapping
aleatoric uncertainty or data uncertainty: due to intrinsic (irreducible) stochasticity in the mapping
We can capture our uncertainty using the following conditional probability distribution: p(y=c∣x;θ)=fc(x;θ)
Maximum likelihood estimation
likelihood function: p(y∣f(x;θ))
log likelihood function ℓ(y,f(x;θ))=−logp(y∣f(x;θ))
Negative Log Likelihood: The average negative log probability of the training set. NLL(θ)=−N1n=1∑Nlogp(yn∣f(xn;θ))
the maximum likelihood estimate (MLE): θ^mle=θargminNLL(θ)
Regression
regression problem
the output space: y∈Y=R.
loss function: quadratic loss, or ℓ2 loss: ℓ(y,y^)=(y−y^)2
mean squared error or MSE: MSE(θ)=N1n=1∑N(yn−f(xn;θ))2
An Example
Uncertainty: Guassian / Normal N(y∣μ,σ2)≜2πσ21e−2σ21(y−μ)2
feature preprocessing, or feature engineering ϕ(x)=[1,x,x2,…,xD]
Deep neural networks
deep neural networks (DNN): a stack of L nested functions: f(x;θ)=fL(fL−1(⋯f1(x)⋯))
the function at layer ℓ: fℓ(x;θℓ)
the final layer: fL(x)=w⊤f1:L−1(x)
the learned feature extractor: f1:L−1(x)
Overfitting and generalization
Underfitting means the model does not capture the relationships in the data.
Overfitting means the model begins to incorporate noise coming from quirks or spurious correlations
it mistakes randomness for patterns and relationships
memorized the data, rather than learned from it
high noise levels in the data and too much complexity in the model
complexity refers to the number of features, terms, or branches in the model and to whether the model is linear or non-linear (non-linear is more complex).
test risk L(θ;Dtest )=∣Dtest ∣1(x,y)∈Dtest ∑ℓ(y,f(x;θ))
Evaluating ML Algorithm Performance Errors & Overfitting
Data scientists decompose the total out-of-sample error into three sources:
Bias error, or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and high in-sample error.
Variance error, or how much the model's results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance, causing overfitting
Base error due to randomness in the data.
learning curve
A learning curve plots the accuracy rate (= 1 – error rate) in the validation or test samples (i.e., out-of-sample) against the amount of data in the training sample
fitting curve
A fitting curve, which shows in-and out-of-sample error rates (Ein and Eout) on the y-axis plotted against model complexity on the x-axis
Evaluating ML Algorithm Performance: Preventing Overfitting in Supervised ML
Two common guiding principles and two methods are used to reduce overfitting:
preventing the algorithm from getting too complex during selection and training (regularization)
proper data sampling achieved by using cross-validation
K-fold cross-validation
data (excluding test sample and fresh data) are shuffled randomly and then are divided into k equal sub-samples
k−1 samples used as training samples and one sample used as a validation sample
k is typically set at 5 or 10
This process is repeated k times. The average of the k validation errors (mean Eval) is taken as a reasonable estimate of the model's out-of-sample error (Eout)
Leave-one-out cross-validation: k=N
No free lunch theorem
All models are wrong, but some models are useful.
--- George Box
No free lunch theorem: There is no single best model that works optimally for all kinds of problems
pick a suitable model
based on domain knowledge
trial and error
cross-validation
Bayesian methods selection techniques
Unsupervised learning
unsupervised learning: “inputs” D={xn:n=1:N} without any corresponding “outputs” yn.
the task: fitting an unconditional model of the form p(x)
When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 neural connections. And you only live for 109 seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.
--- Geoffrey Hinton, 1996
Clustering
finding clusters in data: partition the input into regions that contain “similar” points.
Discovering latent “factors of variation”
Assume that each observed high-dimensional output xn∈RD was generated by a set of hidden or unobserved low-dimensional latent factors zn∈RK.
the system or agent has to learn how to interact
with its environment
RL is closely related to the Markov Decision Process (MDP)
Markov Decision Process
The MDP is the sequence of random variables (Xn) which describes the stochastic evolution of the system states. Of course the distribution of (Xn) depends on the chosen actions.
E denotes the state space of the system. A state x∈E is the information which is available for the controller at time n. Given this information an action has to be selected.
A denotes the action space. Given a specific state x∈E at time n, a certain subclass Dn(x)⊂A of actions may only be admissible.
Qn(B∣x,a) is a stochastic transition kernel which gives the probability that the next state at time n+1 is in the set B if the current state is x and action a is taken at time n.
rn(x,a) gives the (discounted) one-stage reward of the system at time n if the current state is x and action a is taken
gN(x) gives the (discounted) terminal reward of the system at the end of the planning horizon.
A control π is a sequence of decision rules (fn) with fn:E→A where fn(x)∈Dn(x) determines for each possible state x∈E the next action fn(x) at time n. Such a sequence π=(fn) is called policy or strategy. Formally the Markov Decision Problem is given by
complete state observation vs. partial state observation
problems with constraints vs. without constraints
total (discounted) cost criterion vs. average cost criterion
Research questions:
Does an optimal policy exist?
Has it a particular form?
Can an optimal policy be computed efficiently?
Is it possible to derive properties of the optimal value function analytically?
Applications of MDP: Comsumption Problem
Suppose there is an investor with given initial capital. At the beginning of each of N periods she can decide how much of the capital she consumes and how much she invests into a risky asset. The amount she consumes is evaluated by a utility function U as well as the terminal wealth. The remaining capital is invested into a risky asset where we assume that the investor is small and thus not able to influence the asset price and the asset is liquid. How should she consume/invest in order to maximize the sum of her expected discounted utility?
Applications of MDP: Cash Balance or Inventory Problem
Imagine a company which tries to find the optimal level of cash over a finite number of N periods. We assume that there is a random stochastic change in the cash reserve each period (due to withdrawal or earnings). Since the firm does not earn interest from the cash position, there are holding cost for the cash reserve if it is positive, but also interest (cost) in case it is negative. The cash reserve can be increased or decreased by the management at each decision epoch which implies transfer costs. What is the optimal cash balance policy?
Applications of MDP: Mean-Variance Problem
Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after N periods, the one with smallest portfolio variance. What is the optimal investment strategy?
Applications of MDP: Dividend Problem in Risk Theory
Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?
Applications of MDP: Bandit Problem
Suppose we have two slot machines with unknown success probability θ1 and θ2. At each stage we have to choose one of the arms. We receive one Euro if the arm wins, else no cash flow appears. How should we play in order to maximize our expected total reward over N trials?
Applications of MDP: Pricing of American Options
In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.
Discussion
Statistical inference vs. Supervised machine learning
Property
Statistical inference
Supervised machine learning
Goal
Causal models with explanatory power
Prediction performance, often with limited explanatory power
The decision maker, or agent, has a set of possible actions, A, to choose from. Each of these actions has costs and benefits, which will depend on the underlying state of natureH∈H. We can encode this information into a loss functionl(h,a), that specifies the loss we incur if we take action a∈A when the state of nature is h∈H.
The posterior expected loss or risk for each possible action:
R(a∣x)=Ep(h∣x)[l(h,a)]=h∈H∑l(h,a)p(h∣x)
The optimal policy (also called the Bayes estimator) specifies what action to take for each possible observation so as to minimize the risk:
π∗(x)=a∈AargminEp(h∣x)[l(h,a)]
Let U(h,a)=−l(h,a) be the utility function, then the optimal policy is as follows (maximum expected utility principle):
π∗(x)=a∈AargmaxEp(h∣x)[U(h,a)]
Classification problems
We use Bayesian decision theory to decide the optimal class label to predict given an observed input x∈X.
Zero-one loss
Suppose the states of nature correspond to class labels, so H=Y={1,...,C}. Furthermore, suppose the actions also correspond to class labels, so A=Y. In this setting, a very commonly used
the zero-one lossl01(y∗,y^)=I(y∗=y^), defined as follows:
y^=0
y^=1
y∗=0
0
1
y∗=1
1
0
the posterior expected loss
R(y^∣x)=p(y^=y∗∣x)=1−p(y^=y∗∣x)
the optimal policy
π(x)=y∈Yargmaxp(y∣x)
It corresponds to the mode of the posterior distribution, also known as the maximum a posteriori or MAP estimate
ROC curves
Class confusion matrices
For any fixed threshold τ , we consider the following decision rule:
y^τ(x)=I(p(y=1∣x)≥1−τ)
The empirical number of false positives (FP) that arise from using this policy on a set of N labeled examples:
FPτ=n=1∑NI(y^τ(x)=1,yn=0)
The empirical number of false negatives (FN)
The empirical number of true positives (TP)
The empirical number of true negatives (TN)
2×2 class confusion matrixC: Cij is the number of times an item with true class label i was (mis)classified as having label j.
the true positive rate (TPR), also known as the sensitivity, recall or hit rate
TPRτ=p(y^=1∣y=1,τ)=TPτ+FNτTPτ
the false positive rate (FPR), also called the false alarm rate, or the type I error rate
FPRτ=p(y^=1∣y=0,τ)=FPτ+TNτFPτ
We can now plot the TPR vs FPR as an implicit function of τ. This is called a receiver operating characteristic or ROC curve.
Summarizing ROC curves as a scalar
Area Under the Curve (AUC)
higher AUC scores are better
the maximum is 1
Equal Error Rate (EER) or cross-over rate
defined as the value which satisfies FPR = FNR
lower EER scores are better
the minimum is 0
Precision-recall curves
Computing precision and recall
the precision:
P(τ)=p(y=1∣y^=1,τ)=TPτ+FPτTPτ
the recall
R(τ)=p(y^=1∣y=1,τ)=TPτ+FNτTPτ
If y^n∈{0,1} is the predicted label, and yn∈{0,1} is the true label, we can estimate precision and recall using
P(τ)=∑ny^n∑nyny^n R(τ)=∑nyn∑nyny^n
Summarizing PR curves as a scalar
the precision at K score: quote the precision for a fixed recall level, such as the precision of the first K = 10 entities recalled
the interpolated precision: compute the area under the PR curve
the average precision: the average of the interpolated precision, which is equal to the area under the interpolated PR curve
the mean average precision or mAP: the mean of the AP over a set of different PR curves
F-scores
Definition
Fβ1=1+β21P1+1+β2β2R1
Fβ=(1+β2)β2P+RP⋅R=(1+β2)TP+β2FN+FP(1+β2)TP
A special case: β=1
Regression problems
We assume the set of actions and states are both equal to the real
line, A=H=R.
L2 loss
the ℓ2 loss, also called squared error or quadratic loss
ℓ2(h,a)=(h−a)2
the risk function
R(a∣x)=E[(h−a)2∣x]=E[h2∣x]−2aE[h∣x]+a2
the minimum mean squared error estimate or MMSE estimate
∂a∂RR(a∣x)=−2E[h∣x]+2a=0⇒π(x)=E[h∣x]∫hp(h∣x)dh
The ℓ2 loss penalizes deviations from the truth quadratically, and thus is sensitive to outliers.
L1 loss
the ℓ1 loss, also called squared error or quadratic loss
ℓ1(h,a)=∣h−a∣
the optimal estimate is the posterior median
Pr(h<a∗∣x)=Pr(h≥a∗∣x)=0.5
Huber loss
Let r=h−a,
lδ(h,a)={r2/2δ∣r∣−δ2/2 if ∣r∣≤δ if ∣r∣>δ
Probabilistic prediction problems
We assume the true “state of nature” is a distribution, h=p(Y∣x), the action is another distribution, a=q(Y∣x), and we want to pick q to minimize E[l(p,q)] for a given x.
KL, cross-entropy and log-loss
A common form of loss functions for comparing two distributions is the Kullback Leibler divergence, or KL divergence, which is defined as follows:
DKL(p∣∣q)y∈Y∑p(y)logq(y)p(y)
The KL is the extra number of bits we need to use to compress the data due to using the incorrect distribution q.
minimize the cross-entropy
q∗(Y∣x)=qargminH(q(Y∣x),q(Y∣x))
Proper scoring rules
proper scoring rule: a loss function f satisfies the follwoing
l(p,p)≤l(p,q), with equality iff p=q
Brier score:
l(p,q)=C1c=1∑C(q(y=c∣x)−p(y=c∣x))2
Choosing the "right" model
Bayesian hypothesis testing
two hypothesis / model
null hypothesis: M0
alternative hypothesis: M1
0-1 loss
Bayes factor: the ratio of marginal likelihoods of the two models
B1,0=p(D∣M0)p(D∣M1)
Bayes factor BF(1,0)
Interpretation
BF<1/100
Decisive evidence for M0
BF<1/10
Strong evidence for M0
1/10<BF<1/3
Moderate evidence for M0
1/3<BF<1
Weak evidence for M0
1<BF<3
Weak evidence for M1
3<BF<10
Moderate evidence for M1
BF>10
Strong evidence for M1
BF>100
Decisive evidence for M1
Example: Testing if a coin is fair
test if a coin is fair
H0: faire, i.e. θ=0.5
H1: unfair, i.e. θ∈[0,1]
marginal likelyhood under
M0: p(D∣M0)=(21)N
M1 with a Beta prior: p(D∣M1)=∫p(D∣θ)p(θ)dθ=B(α1,α0)B(α1+N1,α0+N0)
Bayesian model selection
Model Selection: pick a most apprppriate model from a set M of more than 2 models
m^=m∈Margmaxp(m∣D)
posterior
p(m∣D)=∑m∈Mp(D∣m)p(m)p(D∣m)p(m)
uniform prior
p(m)=∣M∣1
marginal likelyhood
p(D∣m)=∫p(D∣θ,m)p(θ∣m)dθ
Example: polynomial regression
Occam's razor
Occam’s razor: the simpler the better (for the same marginal likelyhood)
Bayesian Occam’s razor effect: the marginal likelihood will prefer the simpler model.
Connection between cross validation and marginal likelihood
Marginal likelihood is closely related to the leave-one-out cross-validation (LOO-CV) estimate.
the marginal likelyhood can be difficult to compute
the result can be quite sensitive to the choice of prior
The Bayesian information criterion (BIC)
The Bayesian information criterion or BIC can be thought of as a simple approximation to the log marginal likelihood.
logp(D∣m)≈logp(D∣θ^map)+logp(θ^map)−21∣H∣
where H is the Hessian of the negative log joint logp(D,θ) evaluated at the MAP estimate θ^map.
Assuming uniform prior, p(θ)∝1, we can drop the prior term, and replace the MAP estimate with the MLE, θ^
logp(D∣m)≈logp(D∣θ^)−21∣H∣
the BIC score
JBIC(m)=logp(D∣θ^map)−2DmlogN
the BIC loss
LBIC(m)=−2logp(D∣θ^map)+logN
Akaike information criterion
LAIC(m)=−2logp(D∣θ^,m)+2D
This penalizes complex models less heavily than BIC, since the regularization term is independent of N.
This estimator can be derived from a frequentist perspective.
Minimum description length (MDL)
LMDL(m)=−logp(D∣θ^,m)+C(m)
Frequentist decision theory
Computing the risk of an estimator
We define the frequentist risk of an estimator π given an unknown state of nature θ to be the expected loss when applying that estimator to data x sampled from the likelihood function p(x∣θ):
f∗∗=argminfR(f): function that achieves the minimal possible population risk
f∗=argminf∈HR(f): the best function in the hypothesis space
fN∗=argminf∈HR(f,D): the prediction function that minimizes the empirical risk in the hypothesis space
Error decomposition: approximation error (Eapp(H,N)) vs. estimation error or generalization error (Eest(H,N)) Ep∗[R(fN∗)−R(f∗∗)]=R(f∗)−R(f∗∗)+Ep∗[R(fN∗)−R(f∗)]
generalization gap Ep∗[R(fN∗)−R(f∗)]≈Eptr[l(y,fN∗(x))]+Epte[l(y,fN∗(x))]
Regularized risk
regularized empirical risk Rλ(f,D)=R(f,D)+λC(f)
C(f) measures the complexity of the prediction function
λ≥0 is known as a hyperparameter
parametric function form Rλ(θ,D)=R(θ,D)+λC(θ)
log loss, negative log prior regularizer Rλ(θ,D)=−N1n=1∑Nlogp(yn∣xn,θ)−λlogp(θ)
Minimizing this is equivalent to MAP estimation.
Structural risk
how to minimize empirical risk? λ^=λargminθminRλ(θ,D)
It does not work (optimism of the training error) λargminθminRλ(θ,D)=λargminθminRλ(θ,D)+λC(θ)
structural risk minimization: If we knew the regularized population risk Rλ(θ), instead of the regularized empirical risk Rλ(θ,D), we could use it to pick a model of the right complexity (e.g., value of λ)
two methods to estimate the population risk for a given model (value of λ)
cross-validation
statistical learning theory
Cross-validation
partition the dataset into two
training setDtrain: to fit/train the model
validation setDvalid or holdout set: to assess the risk
the empirical risk on the dataset Rλ(θ,D)=∣D∣1(x,y)∈D∑l(y,f(x;θ))+λC(θ)
θ^λ=argminθRλ(θ,D)
validation risk: use the unregularized empirical risk on the validation set as an estimate of the population risk Rλval=R0(θ^λ(Dtrain),Dvalid)
K-folds cross validation (CV):
split the training data into K folds
for each fold k∈{1,…,K}, we train on all the folds but the k’th
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution. We can also use entropy to define the information content of a data source.
Entropy of p
Hard/Easy to predict Xn
Information content of D
high
hard
high
low
easy
low
Entropy for discrete random variables
The entropy of a discrete random variable X with distribution p over K states is defined by
H(X)=−k=1∑Kp(X=k)log2p(X=k)=−EX[logp(X)]
When we use log base 2, the units are called bits (short for binary digits).
When we use log base e, the units are called nats.
Cross entropy
The cross entropy between distribution p and q is defined by
H(p,q)=−k=1∑Kpklogqk
The cross entropy is the expected number of bits needed to compress some data samples drawn from distribution p using a code based on distribution q. This can be minimized by setting q=p, in which case the expected number of bits of the optimal code is H(p,p)=H(p) — this is known as Shannon’s source coding theorem.
Joint entropy
The joint entropy of two random variables X and Y is defined as
H(X,Y)=−x,y∑p(x,y))log2p(x,y)
For example, consider choosing an integer from 1 to 8, n∈{1,...,8}. Let X(n)=1 if n is even, and Y(n)=1 if n is prime:
In general, the above inequality is always valid. If X and Y are independent, then H(X,Y)=H(X)+H(Y), so the bound is tight. This makes intuitive sense: when the parts are correlated in some way, it reduces the “degrees of freedom” of the system, and hence reduces the
overall entropy.
Another observation is that
H(X,Y)≥max{H(X)+H(Y)}
this says combining variables together does not make the entropy go down: you cannot reduce uncertainty merely by adding more unknowns to the problem, you need to observe some data.
Conditional entropy
The conditional entropy of Y given X is the uncertainty we have in Y after seeing X, averaged over possible values for X:
H(Y∣X)=Ep(X)[E(p(Y∣X))]=H(X,Y)−H(X)
It is straight forward to verify that:
H(Y∣X)=H(X,Y)−H(X)≤H(X)+H(Y)−H(X)=H(Y)
H(X1,X2,…,Xn)=∑i=1nH(Xi∣X1,…,Xi−1)
Perplexity
The perplexity of a discrete probability distribution p is defined as
perplexity(p)=2H(p)
It is often interpreted as a measure of predictability. Suppose we have an empirical distribution based on data D:
pD(x∣D)=N1n=1∑Nδxn(x)
We can measure how well p predicts D by computing
perplexity(pD,p)=2H(pD,p)
Differential entropy for continuous random variables *
If X is a continuous random variable with pdf p(x), we define the differential entropy as
h(X)=−∫Xdxp(x)logp(x)
Differential entropy can be negative since pdf’s can be bigger than 1.
The directional derivative measures how much the function f:Rn→R changes along a direction v in space. It is defined as follows
Dvf(x)=h→0limhf(x+hv)−f(x)
Note that the directional derivative along v is the scalar product of the gradient g and the vector v:
Dvf(x)=∇f(x)⋅v
Total derivative
Suppose that some of the arguments to the function depend on each other. Concretely, suppose the function has the form f(t,x(t),y(t)). We define the total derivative of f wrt t as follows:
dtdf=∂t∂f+∂x∂fdtdx+∂y∂fdtdy
If we multiply both sides by the differential dt, we get the total differential
df=∂t∂fdt+∂x∂fdx+∂y∂fdy
This measures how much f changes when we change t, both via the direct effect of t on f, but also indirectly, via the effects of t on x and y.
Jacobian
Consider a function that maps a vector to another vector, f:Rn→Rm. The Jacobian matrix of this function is an m×n matrix of partial derivatives: