polynomial transform (in 1d): ϕ(x)=[1,x,x2,x3,…] f(x;θ)=Wϕ(x)+b
deep neural networks or DNNs
to endow the feature extractor with its own parameters, θ2 f(x;θ)=Wϕ(x;θ2)+b
θ=(θ1,θ2) and θ1=(W,b)
repeat this process recursively, to create more and more complex functions f(x;θ)=fL(fL−1(⋯(f1(x))⋯))
where fℓ(x)=f(x;θℓ) is the function at layer ℓ
The term "DNN" actually encompasses a larger family of models, in which we compose differentiable functions into any kind of DAG (directed acyclic graph, 有向无环图), mapping input to output.
feedforward neural network (FFNN) or multilayer perceptron (MLP): the DAG is a chain.
A MLP is for "structured data" or "tabular data": x∈RD
each column (feature) has a specific meaning
DNNs for “unstructured data”
“unstructured data”: images, text
the input data is variable sized
each individual element (e.g., pixel or word) is often meaningless on its own
DNNs
convolutional neural networks (CNN)→ images
recurrent neural networks (RNN) and transformers→ sequences
graph neural networks (GNN)→ graphs.
Multilayer perceptrons (MLPs)
perceptron (感知机)
is a deterministic version of logistic regression
the functional form: f(x;θ)=I(w⊤x+b≥0)=H(w⊤x+b)
H(a): the heaviside step function, also known as a linear threshold function
perceptrons are very limited in what they can represent due to their linear decision boundaries
The XOR problem
to learn a function that computes the exclusive OR (异-或逻辑) of its two binary inputs
The truth table (真值表) for the function
x1
x2
y
0
0
0
0
1
1
1
0
1
1
1
0
It is clear that the data is not linearly separable, so a perceptron cannot represent this mapping.
this problem can be overcome by stacking multiple perceptrons on top of each other: multilayer perceptron (MLP)
first hidden unit (AND operation) computes h1=x1∧x2:
the second hidden unit (OR operation) computes h2=x1∨x2
the third unit computes the output y=h1∧h2
where hˉ=¬h is the NOT (logical negation) operation
the y value y=f(x1,x2)=(x1∧x2)∧(x1∨x2)
This is equivalent to the XOR function
An MLP can represent any logical function. However, we obviously want to avoid having to specify the weights and biases by hand. In the rest of this chapter, we discuss ways to learn these parameters from data.
Differentiable MLPs
MLPs are difficult to train
the heaviside function is non-differentiable
A natural solution: replacing the heaviside function with a differeentiable one (i.e. the activation functionφ:R→R)
the big idea:
define the hidden units zl at each layer l to be a linear transformation of the hidden units at the previous layer passed elementwise through this activation function
functional form zl=fl(zl−1)=φl(bl+Wlzl−1)
in scalar form, zkl=φl(bkl+j=1∑Kl−1wjklzjl−1)
pre-activations: The quantity that is passed to the activation function al=bl+Wlzl−1
zl=φl(al).
backpropagation (反向传播算法):
compose L of these functions together
compute the gradient of the output wrt the parameters in each layer using the chain rule
pass the gradient to an optimizer to minimize some training objective
Activation functions (激活函数)
linear activation function, φℓ(a)=cℓa, makes the whole model reduces to a regular linear model f(x;θ)=WLcL(WL−1cL−1(⋯(W1x)⋯))∝WLWL−1⋯W1x=W′x
nonlinear activation functions.
sigmoid (logistic) function
a smooth approximation to the Heaviside function used in a perceptron: σ(a)=1+e−a1
it saturates(饱和) at 1 for large positive inputs, and at 0 for large negative inputs
因为Logistic函数的性质,使得装备了Logistic激活函数的神经元具有以下两点性质:
其输出直接可以看作是概率分布,使得神经网络可以更好地和统计学习模型进行结合。
其可以看作是一个软性门(Soft Gate),用来控制其它神经元输出信息的数量。
the tanh function(双曲正切函数) tanh(x)=exp(x)+exp(−x)exp(x)−exp(−x)
an MLP with two hidden layers applied to a 2 d input vector (Figure 13.3)
the model: p(y∣x;θ)a3z2z1=Ber(y∣σ(a3))=w3⊤z2+b3=φ(W2z1+b2)=φ(W1x+b1)
a3=w3⊤z2+b3 is the final logit score, which is converted to a probability via the sigmoid (logistic) function.
layer 2 is computed by taking a nonlinear combination of the 4 hidden units in layer 1 , using z2=φ(W2z1+b2)
layer 1 is computed by taking a nonlinear combination of the 2 input units, using z1=φ(W1x+b1)
By adjusting the parameters, θ=(W1,b1,W2,b2,w3,b3), to minimize the negative log likelihood, we can fit the training data very well
MLP for image classification
"flatten" the 2 d input into 1 d vector: 28×28=784 dimensional
NN structure: use 2 hidden layers with 128 units each, followed by a final 10 way softmax layer
training results
train for just two "epochs" (passes over the dataset)
test set accuracy of 97.1%
the errors seem sensible, e.g., 9 is mistaken as a 3
Training for more epochs can further improve test accuracy
MLP for text classification
convert the variable-length sequence of words v1,…,vT into a fixed dimensional vector x
each vt is a one-hot vector of length V
V is the vocabulary size
the method
treat the input as an unordered bag of words {vt}
the first layer of the model is a E×V embedding matrix W1, which converts each sparse V-dimensional vector to a dense E-dimensional embedding, et=W1vt
convert this set of TE-dimensional embeddings into a fixed-sized vector using global average pooling, eˉ=T1∑t=1Tet
example:
a single hidden layer
a logistic output (for binary classification), we get p(y∣x;θ)heet=Ber(y∣σ(w3⊤h+b3))=φ(W2e+b2)=T1t=1∑Tet=W1vt
NN setting & the training result
vocabulary size: V=1000
embedding size of E=16
hidden layer of size 16
we get 86% on the validation set.
MLP for heteroskedastic regression
a model for heteroskedastic nonlinear regression
outputs:
fμ(x)=E[y∣x,θ]
fσ(x)=V[y∣x,θ].
The two heads:
the μ head, we use a linear activation, φ(a)=a
the σ head, we use a softplus activation, φ(a)=σ+(a)=log(1+ea).
linear heads and a nonlinear backbone, the overall model is given by p(y∣x,θ)=N(y∣wμ⊤f(x;wshared ),σ+(wσ⊤f(x;wshared )))
stochastic volatility model
properties
the mean grows linearly over time
seasonal oscillations
the variance increases quadratically
applications
financial data
global temperature of the earth
The importance of depth
an MLP with one hidden layer is a universal function approximator
deep networks work better than shallow ones
the benefit of learning via a compositional or hierarchical way
Example:
classify DNA strings
the positive class is associated with the regular expression AA??CGCG??AA
it will be easier to learn if the model first learns to detect the AA and CG “motifs” using the hidden units in layer 1
then uses these features to define a simple linear classifier in layer 2
The "deep learning revolution"
some successful stories about DNNs
automatic speech recognition (ASR)
ImageNet image classification benchmark: reducing the error rate from 26% to 16% in a single year
The “explosion” in the usage of DNNs
the availability of cheap GPUs (graphics processing units)
the growth in large labeled datasets
high quality open-source software libraries for DNNs
McCulloch-Pitts model of the neuron (1943): hk(x)=H(wk⊤x−bk), where H(a)=I(a>0)
the inputs x∈RD
the strength of the incoming connections wk∈RD
weighted (dendrites树突) sum of the inputs ak=wk⊤x
threshold (action potential动作电位) bk
hk=1→fire
We can combine multiple such neurons together to make an artificial neural networks, ANNs
ANNs differs from biological brains in many ways, including the following:
Most ANNs use backpropagation to modify the strength of their connections while real brains do not use backprop
there is no way to send information backwards along an axon
they use local update rules for adjusting synaptic strengths
Most ANNs are strictly feedforward (前馈的), but real brains have many feedback connections
It is believed that this feedback acts like a prior
Most ANNs use simplified neurons consisting of a weighted sum passed through a nonlinearity, but real biological neurons have complex dendritic tree structures (see Figure 13.8), with complex spatio-temporal dynamics.
Most ANNs are smaller in size and number of connections than biological brains
Most ANNs are designed to model a single function while biological brains are very complex systems that implement different kinds of functions or behaviors
Backpropagation
backpropagation
simple linear chain of stacked layers: repeated applications of the chain rule of calculus
arbitrary directed acyclic graphs (DAGs): automatic differentiation or autodiff.
Forward vs reverse mode differentiation
Consider a mapping of the form o=f(x)
x∈Rn and o∈Rm
f is defined as a composition of functions: f=f4∘f3∘f2∘f1
f1:Rn→Rm1,f2:Rm1→Rm2,f3:Rm2→Rm3, and f4:Rm3→Rm
The intermediate steps needed to compute o=f(x) are x2=f1(x),x3=f2(x2),x4=f3(x3), and o=f4(x4).
We can compute the Jacobian Jf(x)=∂x⊤∂o∈Rm×n using the chain rule: ∂x∂o=∂x4∂o∂x3∂x4∂x2∂x3∂x∂x2=∂x4∂f4(x4)∂x3∂f3(x3)∂x2∂f2(x2)∂x∂f1(x)=Jf4(x4)Jf3(x3)Jf2(x2)Jf1(x)
we only need to consider how to compute the Jacobian efficiently
Computation graphs
Modern DNNs can combine differentiable components in much more complex ways, to create a computation graph, analogous to how programmers combine elementary functions to make more complex ones.
The only restriction is that the resulting computation graph corresponds to a directed ayclic graph (DAG), where each node is a differentiable function of all its inputs.
example f(x1,x2)=x2ex1x1+x2ex1
We can compute this using the DAG in Figure 13.11, with the following intermediate functions: x3=f3(x1)=ex1x4=f4(x2,x3)=x2x3x5=f5(x1,x4)=x1+x4x6=f6(x5)=x5x7=f7(x4,x6)=x4x6
we have numbered the nodes in topological order (parents before children)
During the backward pass, since the graph is no longer a chain, we may need to sum gradients along multiple paths. For example, since x4 influences x5 and x7, we have ∂x4∂o=∂x5∂o∂x4∂x5+∂x7∂o∂x4∂x7
We can avoid repeated computation by working in reverse topological order. For example, ∂x7∂o∂x6∂o∂x5∂o∂x4∂o=∂x7∂x7=Im=∂x7∂o∂x6∂x7=∂x6∂o∂x5∂x6=∂x5∂o∂x4∂x5+∂x7∂o∂x4∂x7
In general, we use ∂xj∂o=k∈children(j)∑∂xk∂o∂xj∂xk
where the sum is over all children k of node j, as shown in Figure 13.12. The ∂xk∂o gradient vector has already been computed for each child k; this quantity is called the adjoint. This gets multiplied by the Jacobian ∂xj∂xk of each child.
Training neural networks
fit DNNs to data
The standard approach is to use maximum likelihood estimation, by minimizing the NLL: L(θ)=−logp(D∣θ)=−n=1∑Nlogp(yn∣xn;θ)
Tuning the learning rate
It is important to tune the learning rate (step size), to ensure convergence to a good solution. (Section 8.4.3.)
Vanishing and exploding gradients
vanishing gradient problem(梯度消失): When training very deep models, the gradient become very small
exploding gradient problem(梯度爆炸): When training very deep models, the gradient become very large
consider the gradient of the loss wrt a node at layer l : ∂zl∂L=∂zl+1∂L∂zl∂zl+1=Jlgl+1
Jl=∂zl∂zl+1 is the Jacobian matrix
gl+1=∂zl+1∂L is the gradient at the next layer. If Jl is constant across layers, it is clear that the contribution of the gradient from the final layer, gL, to layer l will be JL−lgL. Thus the behavior of the system depends on the eigenvectors of J.
The exploding gradient problem can be ameliorated by gradient clipping(梯度裁剪), in which we cap the magnitude of the gradient if it becomes too large, i.e., we use g′=min(1,∥g∥c)g
This way, the norm of g′ can never exceed c, but the vector is always in the same direction as g.
the vanishing gradient problem is more difficult to solve
Modify the the activation functions at each layer to prevent the gradient from becoming too large or too small
Modify the architecture so that the updates are additive rather than multiplicative
Modify the architecture to standardize the activations at each layer, so that the distribution of activations over the dataset remains constant during training
Carefully choose the initial values of the parameters
Non-saturating activation functions
reason for the gradient vanishe problem
setting: z=σ(Wx), where φ(a)=σ(a)=1+exp(−a)1
for saturating activation functions large weights→a=Wx large→z to saturate
the gradient of the loss wrt the inputs x (from an earlier layer) φ′(a)=σ(a)(1−σ(a))
the gradient of the loss wrt the inputs is ∂x∂L=W⊤δ=W⊤z(1−z)
the gradient of the loss wrt the parameters is ∂W∂L=δx⊤=z(1−z)x⊤
if z is near 0 or 1 , the gradients will go to 0 .
One of the keys to being able to train very deep models is to use non-saturating activation functions.
The most common is rectified linear unit (修正线性单元) or ReLU ReLU(a)=max(a,0)=aI(a>0)
The gradient has the following form: ReLU′(a)=I(a>0)
the gradient will not vanish, as long a z is positive.
suppose we use this in a layer to compute z=ReLU(Wx).
the gradient wrt the inputs has the form ∂x∂L=W⊤I(z>0)
the gradient wrt the parameters ∂W∂L=I(z>0)x⊤
the “dead ReLU” problem:
if the weights are initialized to be large and negative, then it becomes very easy for (some components of) a=Wx to take on large negative values, and hence for z to go to 0 .
This will cause the gradient for the weights to go to 0 .
The algorithm will never be able to escape this situation,
the hidden units (components of z) will stay permanently off.
Non-saturating ReLU
the leaky ReLU LReLU(a;α)=max(αa,a)
0<α<1.
The slope of this function is 1 for positive inputs, and α for negative inputs, thus ensuring there is some signal passed back to earlier layers, even when the input is negative.
If we allow the parameter α to be learned, rather than fixed, the leaky ReLU is called parametric ReLU
the Exponential Linear Unit, ELU (指数线性单元) ELU(a;α)={α(ea−1)a if a≤0 if a>0
This has the advantage over leaky ReLU of being a smooth function.
SELU (self-normalizing ELU): A slight variant of ELU SELU(a;α,λ)=λELU(a;α)
by setting α and λ to carefully chosen values, this activation function is guaranteed to ensure that the output of each layer is standardized (provided the input is also standardized)
This can help with model fitting.
Softplus函数[Dugas et al., 2001] 可以看作是 rectifier 函数的平滑版本,其定义为:
Softplus(x)=log(1+exp(x))
Softplus′(x)=σ(x)
Other choices
swish (do well on some image classification benchmarks) swish(a;β)=aσ(βa)
also called SiLU (for Sigmoid Linear Unit)
σ(⋅)∈(0,1) 可看作一种软性门控机制:
σ(βx)接近1时,门处于“开”状态,激活函数的输出近似于 x 本身
σ(βx)接近0时,门处于“关”状态,激活函数的输出近似于0
Maxout单元
Maxout 单元 [Goodfellow et al., 2013] 也是一种分段线性函数。其他激活函数输入为上一层神经元的尽输入z,Maxout的输入为上一层神经元的全部原始输入x=[x1,x2,…,xd]。每个Maxout单元有K个权向量wk∈Rd和偏置bk(1≤k≤K): zk=wkTx+bk
Maxout单元非线性函数定义为:
maxout(x)=k∈[1,K]max(zk)
Gaussian Error Linear Unit, GELU GELU(a)=aΦ(a)
where Φ(a) is the cdf of a standard normal: Φ(a)=Pr(N(0,1)≤a)=21(1+erf(a/2))
We can think of GELU as a "soft" version of ReLU, since it replaces the step function I(a>0) with the Gaussian cdf, Φ(a).
the GELU can be motivated as an aptive version of dropout, where we multiply the input by a binary scalar mask, m∼Ber(Φ(a) ), where the probability of being dropped is given by 1−Φ(a). Thus the expected output is E[a]=Φ(a)×a+(1−Φ(a))×0=aΦ(a)
We can approximate GELU using swish with a particular parameter setting, namely GELU(a)≈aσ(1.702a)
Residual connections
residual network or ResNet (残差网络)
One solution to the vanishing gradient problem for DNNs
this is a feedforward model in which each layer has the form of a residual block, defined by Fl′(x)=Fl(x)+x
Fl is a standard shallow nonlinear mapping (e.g., linear-activation-linear).
The inner Fl function computes the residual term or delta that needs to be added to the input x to generate the desired output
it is often easier to learn to generate a small perturbation to the input than to directly predict the output.
A model with residual connections has the same number of parameters as a model without residual connections, but it is easier to train
gradients can flow directly from the output to earlier layers (Figure 13.15b)
the activations at the output layer can be derived in terms of any previous layer l using zL=zl+i=l∑L−1Fi(zi;θi).
the gradient of the loss wrt the parameters of the l 'th layer: ∂θl∂L=∂θl∂zl∂zl∂L=∂θl∂zl∂zL∂L∂zl∂zL=∂θl∂zl∂zL∂L(1+i=l∑L−1∂zl∂Fi(zi;θi))=∂θl∂zl∂zL∂L+ otherterms
Thus we see that the gradient at layer l depends directly on the gradient at layer L in a way that is independent of the depth of the network.
Regularization
Early stopping
the heuristic of stopping the training procedure when the error on the validation set starts to increase
This method works because we are restricting the ability of the optimization algorithm to transfer information from the training examples to the parameters
Weight decay
impose a prior on the parameters, and then use MAP estimation.
It is standard to use a Gaussian prior for the weights N(w∣0,α2I) and biases, N(b∣0,β2I).
This is equivalent to ℓ2 regularization of the objective.
this is called weight decay, since it encourages small weights, and hence simpler models, as in ridge regression
Dropout
randomly (on a per-example basis) turn off all the outgoing connections from each neuron with probability p
Dropout can dramatically reduce overfitting and is very widely used.
it prevents complex co-adaptation of the hidden units.
Bayesian neural networks
Modern DNNs are usually trained using a (penalized) maximum likelihood objective to find a single setting of parameters.
with large models, there are often many more parameters than data points
there may be multiple possible models which fit the training data equally well, yet which generalize in different ways.
It is often useful to capture the induced uncertainty in the posterior predictive distribution p(y∣x,D)=∫p(y∣x,θ)p(θ∣D)dθ
Bayesian neural network or BNN.
It can be thought of as an infinite ensemble of differently weight neural networks.
By marginalizing out the parameters, we can avoid overfitting.