L05a Neural Networks for Structured Data

Materials are adopted from "Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.". This handout is only for teaching. DO NOT DISTRIBUTE.

Introduction

So far, we have learned:


Multilayer perceptrons (MLPs)


The XOR problem

x1x_1 x2x_2 yy
0 0 0
0 1 1
1 0 1
1 1 0

An MLP can represent any logical function. However, we obviously want to avoid having to specify the weights and biases by hand. In the rest of this chapter, we discuss ways to learn these parameters from data.


Differentiable MLPs


Activation functions (激活函数)


Example models

MLPs can be used to perform classification and regression for many kinds of data. We give some examples below.

Try it for yourself via: https://playground.tensorflow.org

MLP for classifying 2 d2 \mathrm{~d} data into 2 categories

an MLP with two hidden layers applied to a 2 d input vector (Figure 13.313.3)

MLP for image classification

MLP for text classification

MLP for heteroskedastic regression


The importance of depth


The "deep learning revolution"


Connections with biology


Backpropagation


Forward vs reverse mode differentiation


Computation graphs

where the sum is over all children kk of node jj, as shown in Figure 13.1213.12. The oxk\frac{\partial \boldsymbol{o}}{\partial \boldsymbol{x}_{k}} gradient vector has already been computed for each child kk; this quantity is called the adjoint. This gets multiplied by the Jacobian xkxj\frac{\partial \boldsymbol{x}_{k}}{\partial \boldsymbol{x}_{j}} of each child.


Training neural networks


Tuning the learning rate

It is important to tune the learning rate (step size), to ensure convergence to a good solution. (Section 8.4.3.)


Vanishing and exploding gradients


Non-saturating activation functions

reason for the gradient vanishe problem

Name Definition Range Reference
Sigmoid σ(a)=11+ea\sigma(a)=\frac{1}{1+e^{-a}} [0,1][0,1]
Hyperbolic tangent tanh(a)=2σ(2a)1\tanh(a)=2\sigma(2a)-1 [1,1][-1,1]
Softplus σ+(a)=log(1+e2)\sigma_+(a)=\log(1+e^2) [0,][0,\infty] [GBB11]
Rectified linear unit ReLU(a)=max(a,0)\operatorname{ReLU}(a) = \max(a, 0) [0,][0,\infty] [GBB11;KSH12]
Leaky ReLU max(a,0)+αmin(a,0)\max(a, 0) + \alpha \min(a, 0) [,][-\infty,\infty] [MHN13]
Exponential linear unit max(a,0)+min(α(ea1),0)\max(a,0)+\min(\alpha(e^a-1),0) [,][-\infty,\infty] [CUH16]
Swish aσ(a)a\sigma(a) [,][-\infty,\infty] [RZL17]
GELU aΦ(a)a\Phi(a) [,][-\infty,\infty] [HG16]

ReLU

Non-saturating ReLU

Softplus(x)=log(1+exp(x))\text{Softplus}(x)=\log(1+\exp(x))

Softplus(x)=σ(x)\text{Softplus}'(x)=\sigma(x)

Other choices

maxout(x)=maxk[1,K](zk)\text{maxout}(x)=\max_{k\in[1,K]}(z_k)


Residual connections


Regularization


Early stopping


Weight decay


Dropout


Bayesian neural networks