Cross-validation using sk-learn: train_test_split

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

((150, 4), (150,))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape

((90, 4), (90,))

X_test.shape, y_test.shape

((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

Property	Statistical inference	Supervised machine learning
Goal	Causal models with explanatory power	Prediction performance, often with limited explanatory power
Data	The data is generated by a model	The data generation process is unknown
Framework	Probabilistic	Algorithmic and Probabilistic
Expressibility	Typically linear	Non-linear
Model selection	Based on information criteria	Numerical optimization
Scalability	Limited to lower-dimensional data	Scales to high-dimensional input data
Robustness	Prone to over-ﬁtting	Designed for out-of-sample performance
Diagnostics	Extensive	Limited

Bayes factor	Interpretation
	Decisive evidence for
	Strong evidence for
	Moderate evidence for
	Weak evidence for
	Weak evidence for
	Moderate evidence for
	Strong evidence for
	Decisive evidence for

Entropy of	Hard/Easy to predict	Information content of
high	hard	high
low	easy	low

L01 Introduction of Machine Learning

What is Machine Learning

Supervised Learning

Classification

Example: classifying Iris flowers

Exploratory data analysis

Learning a classifier

Empirical risk minimization

Uncertainty

Maximum likelihood estimation

Regression

(simple) Linear regression

Polynomial regression

Deep neural networks

Overfitting and generalization

Evaluating ML Algorithm Performance Errors & Overfitting

learning curve

fitting curve

Evaluating ML Algorithm Performance: Preventing Overfitting in Supervised ML

Cross-validation using sk-learn: train_test_split

Cross-validation using sk-learn: computing cross-validated metrics

Time Series Split

No free lunch theorem

Unsupervised learning

Clustering

Discovering latent “factors of variation”

Reinforcement learning

Markov Decision Process

Applications of MDP: Comsumption Problem

Applications of MDP: Cash Balance or Inventory Problem

Applications of MDP: Mean-Variance Problem

Applications of MDP: Dividend Problem in Risk Theory

Applications of MDP: Bandit Problem

Applications of MDP: Pricing of American Options

Discussion

Statistical inference vs. Supervised machine learning

Financial Econometrics and Machine Learning

ML Algorithm Types

Selecting ML Algorithms

Useful Python Libraries

Popular Data Science Libraries in Python

Math Libraries

Statistical Libraries

ML and Deep Learning

Decision Theory

Bayesian decision theory

Basics

Classification problems

Zero-one loss

ROC curves

Class confusion matrices

Summarizing ROC curves as a scalar

Precision-recall curves

Computing precision and recall

Summarizing PR curves as a scalar

F-scores

Regression problems

L2 loss

L1 loss

Huber loss

Probabilistic prediction problems

KL, cross-entropy and log-loss

Proper scoring rules

Choosing the "right" model

Bayesian hypothesis testing

Example: Testing if a coin is fair

Bayesian model selection

Example: polynomial regression

Occam's razor

Connection between cross validation and marginal likelihood

Information criteria

The Bayesian information criterion (BIC)

Akaike information criterion

Minimum description length (MDL)

Frequentist decision theory

Computing the risk of an estimator

Example: estimate a Gaussian mean

Bayes risk

Maximum risk

Empirical risk minimization