Lecture 01: Introduction to Financial Machine Learning

Dimension	Data‑Modeling Culture	Algorithmic‑Modeling Culture
Goal	Explain relationships, estimate parameters	Predict outcomes, improve accuracy
Approach	Assume a stochastic model   and estimate parameters	Learn the mapping   directly from data
Typical  Methods	Linear/Logistic Regression,  ARIMA, Probit	Decision Trees, Random Forest,  Neural Networks
Evaluation	Significance tests, Confidence intervals	Out‑of‑sample error,  Cross‑validation performance
Assumptions	Small samples, low dimension,  strong structure	Large samples,  high dimension, weak assumptions
Focus	Parameter interpretability (“Why”)	Predictive power (“What”)

Category	Example Algorithms	Finance Applications
Regression	OLS, LASSO, Partial Least Squares (PLS), Principal Component Regression (PCR), Random Forest	Asset pricing, risk models, factor selection
Classification	Logistic Regression, SVM, XGBoost, MLP	Credit scoring, fraud detection
Clustering	k-means, hierarchical clustering	Market segmentation, regime detection
Dimensionality Reduction	PCA, Autoencoder, PLS, PCR	Factor extraction, latent risk modeling

Category	Example Algorithms	Finance Applications
Advanced Causal / Structural ML	Double Machine Learning (DML)	Treatment effects, policy evaluation, causal inference in finance
Representation Learning (Deep)	CNN, RNN, Transformer, GNN	Text & news analytics, time-series forecasting, ESG image analysis
Generative Models	GAN, Variational Autoencoder	Scenario simulation, synthetic financial data
Reinforcement Learning (RL)	Q-learning, PPO	Portfolio optimization, trading strategies

Model	Out‑of‑Sample R²	Description
OLS	≈ 0 %	Traditional linear benchmark
LASSO / Ridge	0.5 – 0.8 %	Regularized linear structure
Elastic Net / NN	≈ 1.2 %	Best predictive models
RF / GBRT	≈ 1.0 %	Nonlinear tree ensemble

Element	Financial Interpretation
Environment (state )	Market features at time  — prices, returns, volatility, macro signals
Agent’s action	Portfolio rebalancing vector (weights or trades)
Reward	Risk‑adjusted return (e.g., Sharpe ratio or log‑utility gain)
Policy	Mapping from observed state to allocation decision
Goal	Maximize expected discounted cumulative reward

Model	Mean AUC (Avg Across Datasets)	Comments
Logistic Regression	≈ 0.75	Benchmark industry standard
Decision Tree	0.77 – 0.80	Simple nonlinear model
Random Forest	0.82 – 0.85	Robust ensemble performance
Gradient Boosting (GBM)	0.84 – 0.88	Top ranked accuracy and consistency
Neural Network	0.80 – 0.86	Competitive but data‑sensitive
SVM	≈ 0.82	Needs careful parameter tuning

Category	Finding
Macro tracking	The first news factor correlates ≈ 0.8 with industrial production growth  and NBER recession dates.
Forecasting power	News factors predict future GDP growth, unemployment changes,  and credit spreads 1–12 months ahead.
Asset pricing link	Expected returns and valuation ratios co‑move with news‑inferred cycle states.
Real‑time advantage	News data update daily → timely signals vs lagged macro statistics.

方法	概念	金融意义
LASSO	L1惩罚	特征选择：因子稀疏性
Ridge	L2惩罚	稳健估计：防止过拟合
Dropout	随机隐藏节点	模型泛化

Cross-validation using sk-learn: train_test_split

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

((150, 4), (150,))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape

((90, 4), (90,))

X_test.shape, y_test.shape

((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

Type	Metrics	Meaning
Regression	MSE, MAE, R²	Fit quality
Classification	Accuracy, Precision, Recall, F1	Discrimination
Ranking	AUC, ROC	Signal quality
Portfolio	Sharpe ratio	Risk-adjusted performance

层面	培养目标
技术	掌握常见ML算法与评估
金融	结合金融数据场景建模
研究	理解方法背后的经济含义

Lecture	Theme	Focus
1	Introduction	背景与框架
2	Shallow Learning	经典统计学习
3	Deep Learning	表征学习
4	Reinforcement Learning	决策建模
5	Big Data	文本与图像分析
6	Asset Pricing & ML	实证与量化应用
7	Causal Inference	因果与解释
8	LLMs & Agents	智能体与AI前沿

Element	Financial Interpretation
Environment (state )	Market features at time  — prices, returns, volatility, macro signals
Agent’s action	Portfolio rebalancing vector (weights or trades)
Reward	Risk‑adjusted return (e.g., Sharpe ratio or log‑utility gain)
Policy	Mapping from observed state to allocation decision
Goal	Maximize expected discounted cumulative reward

Lecture 01

Introduction to Financial Machine Learning

Outlines

Motivation

The Changing Financial Landscape

文本数据类研究（Textual Analysis）：Loughran & McDonald (2011, JF)

文本数据类研究（Textual Analysis）：Hassan et al. (2019, QJE)

文本数据类研究（Textual Analysis）：Hoberg & Phillips (2016, JPE)

互联网与社交媒体数据：Da, Engelberg & Gao (2011, Journal of Finance)

互联网与社交媒体数据：Bollen, Mao & Zeng (2011, Journal of Computational Science)

互联网与社交媒体数据：Ke, Kelly & Xiu (2021 WP; 2024 RFS)

卫星与图像数据（Remote Sensing）：Henderson, Storeygard & Weil (2012, AER)

卫星与图像数据（Remote Sensing）：Jean et al. (2016, Science)

气象与环境数据（Weather‑Based Big Data）：DeHaan, Madsen & Piotroski (2017, JAR)

气象与环境数据（Weather‑Based Big Data）：Hirshleifer & Shumway (2003, Journal of Finance)

气象与环境数据（Weather‑Based Big Data）：Novy‑Marx (2014, JFE)

机器学习与高维金融数据：Gu, Kelly & Xiu (2020, RFS)

机器学习与高维金融数据：Bryzgalova, Pelger & Zhu (2025, Journal of Finance)

金融与大数据

Prices are Predictions

Summary — Why ML?

Core Idea

Breiman’s Arguments

Implications for Finance

Further Readings

Three Paradigms

Supervised Learning

Supervised Learning: Classification

Example: classifying Iris flowers

Example: Image Classification

Exploratory data analysis

Learning a classifier

Empirical risk minimization

Uncertainty

Maximum likelihood estimation

Regression

(Simple) Linear regression

Polynomial regression

Overfitting and generalization

Evaluating ML Algorithm Performance Errors & Overfitting

learning curve

fitting curve

Unsupervised learning

Unsupervised learning: Clustering

Discovering latent “factors of variation”

Reinforcement learning

Markov Decision Process

Applications of MDP: Comsumption Problem

Applications of MDP: Cash Balance or Inventory Problem

Applications of MDP: Mean-Variance Problem

Applications of MDP: Dividend Problem in Risk Theory

Applications of MDP: Bandit Problem

Applications of MDP: Pricing of American Options

Algorithm Overview (1/2)

Algorithm Overview (2/2)

Selecting ML Algorithms

Course Roadmap (8 Lectures)

Gu, Kelly & Xiu (2020, Review of Financial Studies)

Motivation & Research Question

Model and Data

Predictive Performance

Economic Interpretation

Takeaways

Reference

Dixon et al. (2020) · Deep Reinforcement Learning in Finance

Motivation · Traditional vs Reinforcement Approaches

Reinforcement Learning Framework in Portfolio Choice

Methodology Highlights (from Dixon et al., 2020)

Empirical Setup & Results

Interpretation & Insights

Visualization — Agent Market Reward Loop

Takeaways

References

Lessmann et al. (2015) · EJOR

Motivation & Problem

Data & Experimental Design

Key Empirical Findings

Interpretation & Insights

Visualization — Credit Scoring Pipeline

Takeaways

文本数据类研究（Textual Analysis）：Hassan et al. (2019, QJE)

互联网与社交媒体数据：Ke, Kelly & Xiu (2021 WP; 2024 RFS)

卫星与图像数据（Remote Sensing）：Jean et al. (2016, Science)

Core Idea

Breiman’s Arguments

Implications for Finance

Further Readings

Gu, Kelly & Xiu (2020, Review of Financial Studies)

Motivation & Research Question

Model and Data

Predictive Performance

Economic Interpretation

Dixon et al. (2020) · Deep Reinforcement Learning in Finance

Motivation · Traditional vs Reinforcement Approaches

Reinforcement Learning Framework in Portfolio Choice

Methodology Highlights (from Dixon et al., 2020)

Empirical Setup & Results

Interpretation & Insights

Visualization — Agent  Market  Reward Loop

Lessmann et al. (2015) · EJOR

Motivation & Problem

Data & Experimental Design

Key Empirical Findings

Interpretation & Insights

Visualization — Credit Scoring Pipeline

Bybee · Kelly · Manela · Xiu (2024 · Journal of Finance)

Motivation & Research Question

Data and Methodology

Data Construction

Dimensionality Reduction

Empirical Findings

Economic Interpretation

Visualization — Text → Factors → Economy Loop