classification problem
Iris Flowers Classification
Image Classification
We must avoid] false confidence bred from an ignorance of the probabilistic nature of the world, from a desire to see black and white where we should rightly see gray.
--- Immanuel Kant, as paraphrased by Maria Konnikova
likelihood function:
log likelihood function
regression problem
generalization gap:
test risk
Two common guiding principles and two methods are used to reduce overfitting:
K-fold cross-validation
Leave-one-out cross-validation:
![]() |
((150, 4), (150,))
((90, 4), (90,))
((60, 4), (60,))
|
![]() |
array([0.96..., 1. , 0.96..., 0.96..., 1. ])
0.98 accuracy with a standard deviation of 0.02
array([0.96..., 1. ..., 0.96..., 0.96..., 1. ]) |
![]() |
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)
[0 1 2] [3] |
All models are wrong, but some models are useful.
--- George Box
When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 neural connections. And you only live for 109 seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.
--- Geoffrey Hinton, 1996
Assume that each observed high-dimensional output
factor analysis (FA)
principal components analysis (PCA):
nonlinear models: neural networks
The MDP is the sequence of random variables (
A control
Types of MDP problems:
Research questions:
Suppose there is an investor with given initial capital. At the beginning of each of
Imagine a company which tries to find the optimal level of cash over a finite number of
Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after
Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?
Suppose we have two slot machines with unknown success probability
In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.
Property | Statistical inference | Supervised machine learning |
---|---|---|
Goal | Causal models with explanatory power | Prediction performance, often with limited explanatory power |
Data | The data is generated by a model | The data generation process is unknown |
Framework | Probabilistic | Algorithmic and Probabilistic |
Expressibility | Typically linear | Non-linear |
Model selection | Based on information criteria | Numerical optimization |
Scalability | Limited to lower-dimensional data | Scales to high-dimensional input data |
Robustness | Prone to over-fitting | Designed for out-of-sample performance |
Diagnostics | Extensive | Limited |
Dixon M F, Halperin I, Bilokon P. Machine learning in Finance[M]. Springer International Publishing, 2020.
The decision maker, or agent, has a set of possible actions,
We use Bayesian decision theory to decide the optimal class label to predict given an observed input
Suppose the states of nature correspond to class labels, so
It corresponds to the mode of the posterior distribution, also known as the maximum a posteriori or MAP estimate
For any fixed threshold
We assume the set of actions and states are both equal to the real
line,
Let
We assume the true “state of nature” is a distribution,
A common form of loss functions for comparing two distributions is the Kullback Leibler divergence, or KL divergence, which is defined as follows:
The KL is the extra number of bits we need to use to compress the data due to using the incorrect distribution
minimize the cross-entropy
Bayes factor |
Interpretation |
---|---|
Decisive evidence for |
|
Strong evidence for |
|
Moderate evidence for |
|
Weak evidence for |
|
Weak evidence for |
|
Moderate evidence for |
|
Strong evidence for |
|
Decisive evidence for |
Occam’s razor: the simpler the better (for the same marginal likelyhood)
Bayesian Occam’s razor effect: the marginal likelihood will prefer the simpler model.
Marginal likelihood is closely related to the leave-one-out cross-validation (LOO-CV) estimate.
where
Suppose we use a plugin approximation to the above distribution to get
Then we get
where H is the Hessian of the negative log joint
We define the frequentist risk of an estimator π given an unknown state of nature
Minimizing this is equivalent to MAP estimation.
structural risk minimization: If we knew the regularized population risk
two methods to estimate the population risk for a given model (value of
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution. We can also use entropy to define the information content of a data source.
Entropy of |
Hard/Easy to predict |
Information content of |
---|---|---|
high | hard | high |
low | easy | low |
The entropy of a discrete random variable
The cross entropy between distribution
The cross entropy is the expected number of bits needed to compress some data samples drawn from distribution
The joint entropy of two random variables
For example, consider choosing an integer from 1 to 8,
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
The joint distribution is
so the joint entropy is given by
Consider the marginal entropys:
We observe that
In general, the above inequality is always valid. If X and Y are independent, then
Another observation is that
this says combining variables together does not make the entropy go down: you cannot reduce uncertainty merely by adding more unknowns to the problem, you need to observe some data.
The conditional entropy of
It is straight forward to verify that:
The perplexity of a discrete probability distribution
It is often interpreted as a measure of predictability. Suppose we have an empirical distribution based on data
We can measure how well
If
Differential entropy can be negative since pdf’s can be bigger than 1.
The entropy of a d-dimensional Gaussian is
In the 1d case, this becomes
The directional derivative measures how much the function
Note that the directional derivative along
Suppose that some of the arguments to the function depend on each other. Concretely, suppose the function has the form
If we multiply both sides by the differential
This measures how much f changes when we change
Consider a function that maps a vector to another vector,
The Jacobian vector product or JVP is defined to be the operation that corresponds to right multiplying the Jacobian matrix
So we can see that we can approximate this numerically using just 2 calls to
The vector Jacobian product or *VJP is defined to be the operation that corresponds to left-multiplying the Jacobian matrix
Let
For a function f : Rn → R that is twice differentiable, we define the Hessian matrix as the (symmetric)
The Hessian is the Jacobian of the gradient.