where
The term "DNN" actually encompasses a larger family of models, in which we compose differentiable functions into any kind of DAG (directed acyclic graph, 有向无环图), mapping input to output.
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
It is clear that the data is not linearly separable, so a perceptron cannot represent this mapping.
this problem can be overcome by stacking multiple perceptrons on top of each other: multilayer perceptron (MLP)
first hidden unit (AND operation) computes
the second hidden unit (OR operation) computes
the third unit computes the output
An MLP can represent any logical function. However, we obviously want to avoid having to specify the weights and biases by hand. In the rest of this chapter, we discuss ways to learn these parameters from data.
Example models
|
MLPs can be used to perform classification and regression for many kinds of data. We give some examples below. Try it for yourself via: https://playground.tensorflow.org |
an MLP with two hidden layers applied to a 2 d input vector (Figure
McCulloch-Pitts model of the neuron (1943):
We can combine multiple such neurons together to make an artificial neural networks, ANNs
ANNs differs from biological brains in many ways, including the following:
Most ANNs use backpropagation to modify the strength of their connections while real brains do not use backprop
Most ANNs are strictly feedforward (前馈的), but real brains have many feedback connections
Most ANNs use simplified neurons consisting of a weighted sum passed through a nonlinearity, but real biological neurons have complex dendritic tree structures (see Figure 13.8), with complex spatio-temporal dynamics.
Most ANNs are smaller in size and number of connections than biological brains
Most ANNs are designed to model a single function while biological brains are very complex systems that implement different kinds of functions or behaviors
The intermediate steps needed to compute
We can compute the Jacobian
where the sum is over all children
It is important to tune the learning rate (step size), to ensure convergence to a good solution. (Section 8.4.3.)
vanishing gradient problem(梯度消失): When training very deep models, the gradient become very small
exploding gradient problem(梯度爆炸): When training very deep models, the gradient become very large
consider the gradient of the loss wrt a node at layer
reason for the gradient vanishe problem
Name | Definition | Range | Reference |
---|---|---|---|
Sigmoid | |||
Hyperbolic tangent | |||
Softplus | [GBB11] | ||
Rectified linear unit | [GBB11;KSH12] | ||
Leaky ReLU | [MHN13] | ||
Exponential linear unit | [CUH16] | ||
Swish | [RZL17] | ||
GELU | [HG16] |
the leaky ReLU
the Exponential Linear Unit, ELU (指数线性单元)
SELU (self-normalizing ELU): A slight variant of ELU
Softplus函数[Dugas et al., 2001]
Maxout 单元 [Goodfellow et al., 2013] 也是一种分段线性函数。其他激活函数输入为上一层神经元的尽输入
Maxout单元非线性函数定义为:
Dropout
|
|
## We will - Algorithms - xxx - Coding with python - Financial applications --- /