Combining Satellite Imagery and Machine Learning to Predict Poverty.
Do Weather‑Induced Moods Affect the Processing of Earnings News?
Good Day Sunshine: Stock Returns and the Weather.
Predicting Anomaly Performance with Politics, the Weather, Sunspots and the Stars.
Empirical Asset Pricing via Machine Learning.
Forest through the Trees: Building Cross‑Sections of Stock Returns.
金融与大数据
来源: Nagel S. Machine learning in asset pricing[M]. Princeton University Press, 2021. |
|
“Asset prices reflect collective market expectations of future cash flows.”
— SSRN (2023)
“There are two cultures in the use of statistical modeling:
one assumes the data are generated by a given stochastic model;
the other uses algorithmic models to fit the data.”
— Leo Breiman (2001)
| Dimension | Data‑Modeling Culture | Algorithmic‑Modeling Culture |
|---|---|---|
| Goal | Explain relationships, estimate parameters | Predict outcomes, improve accuracy |
| Approach | Assume a stochastic model |
Learn the mapping directly from data |
| Typical Methods |
Linear/Logistic Regression, ARIMA, Probit |
Decision Trees, Random Forest, Neural Networks |
| Evaluation | Significance tests, Confidence intervals | Out‑of‑sample error, Cross‑validation performance |
| Assumptions | Small samples, low dimension, strong structure |
Large samples, high dimension, weak assumptions |
| Focus | Parameter interpretability (“Why”) | Predictive power (“What”) |
Traditional statistics relies too heavily on model assumptions, often poorly representing real‑world complexity.
The algorithmic culture gives priority to empirical prediction, not theoretical form.
For real applications (speech, vision, finance), algorithmic models outperform classical ones.
Statistical education and research should emphasize prediction, validation, and model performance rather than significance alone.
Prediction and causality are complements, not rivals.
Pattern discovery at different levels of supervision.
classification problem
We must avoid false confidence bred from an ignorance of the probabilistic nature of the world, from a desire to see black and white where we should rightly see gray.
--- Immanuel Kant, as paraphrased by Maria Konnikova
likelihood function:
log likelihood function
generalization gap:
test risk
When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 neural connections. And you only live for 109 seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.
--- Geoffrey Hinton, 1996
Assume that each observed high-dimensional output
factor analysis (FA)
principal components analysis (PCA):
nonlinear models: neural networks
The MDP is the sequence of random variables (
A control
Types of MDP problems:
Research questions:
Suppose there is an investor with given initial capital. At the beginning of each of
Imagine a company which tries to find the optimal level of cash over a finite number of
Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after
Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?
Suppose we have two slot machines with unknown success probability
In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.
| Category | Example Algorithms | Finance Applications |
|---|---|---|
| Regression | OLS, LASSO, Partial Least Squares (PLS), Principal Component Regression (PCR), Random Forest | Asset pricing, risk models, factor selection |
| Classification | Logistic Regression, SVM, XGBoost, MLP | Credit scoring, fraud detection |
| Clustering | k-means, hierarchical clustering | Market segmentation, regime detection |
| Dimensionality Reduction | PCA, Autoencoder, PLS, PCR | Factor extraction, latent risk modeling |
| Category | Example Algorithms | Finance Applications |
|---|---|---|
| Advanced Causal / Structural ML | Double Machine Learning (DML) | Treatment effects, policy evaluation, causal inference in finance |
| Representation Learning (Deep) | CNN, RNN, Transformer, GNN | Text & news analytics, time-series forecasting, ESG image analysis |
| Generative Models | GAN, Variational Autoencoder | Scenario simulation, synthetic financial data |
| Reinforcement Learning (RL) | Q-learning, PPO | Portfolio optimization, trading strategies |
|
source: CFA Curriculum
|
"Machine learning extracts return‑predictive structure from the
high‑dimensional space of firm characteristics."
Background:
Empirical asset pricing traditionally relies on a few linear factors (e.g., Fama–French 3F/5F).
Yet literature has discovered hundreds of potential predictors → the “factor zoo.”
Challenges:
Core question:
Can machine learning methods find stable and economically meaningful prediction patterns
in the cross‑section of expected stock returns?
Data
Machine Learning Models
1. Regularized linear models (LASSO, Ridge, Elastic Net)
2. Tree‑based models (Random Forest, Gradient Boosting)
3. Neural networks (Feedforward MLP)
Estimation Design
| Model | Out‑of‑Sample R² | Description |
|---|---|---|
| OLS | ≈ 0 % | Traditional linear benchmark |
| LASSO / Ridge | 0.5 – 0.8 % | Regularized linear structure |
| Elastic Net / NN | ≈ 1.2 % | Best predictive models |
| RF / GBRT | ≈ 1.0 % | Nonlinear tree ensemble |
Portfolio Implications
ML models automatically rediscover and combine known signals
Capture nonlinear interactions and time‑varying relations often ignored by linear models.
The learned function
consistent with no‑arbitrage pricing.
Regularization = statistical discipline; ML = economic flexibility.
Bottom Line:
ML techniques deliver higher predictive accuracy, robust SDF estimates, and new ways to understand risk premia.
Gu, S., Kelly, B., and Xiu, D. (2020).
Empirical Asset Pricing via Machine Learning.
Review of Financial Studies, 33(5): 2223 – 2273.
“The goal is to let a learning agent autonomously adjust portfolio weights to
maximize risk‑adjusted returns through interaction with the market environment.”
Classical portfolio optimization (Markowitz 1952):
One‑shot mean–variance trade‑off given expected returns and covariances.
→ Static optimization; requires ex ante distributional assumptions.
Real investment environments:
Reinforcement Learning (RL):
Learns an optimal policy through trial and error to maximize cumulative reward.
| Element | Financial Interpretation |
|---|---|
| Environment (state |
Market features at time |
| Agent’s action |
Portfolio rebalancing vector (weights or trades) |
| Reward |
Risk‑adjusted return (e.g., Sharpe ratio or log‑utility gain) |
| Policy |
Mapping from observed state to allocation decision |
| Goal | Maximize expected discounted cumulative reward |
Training loop (Agent
1. Observe market state → 2. Select action (rebalance) → 3. Receive reward → 4. Update policy.
Model Type: Deep Reinforcement Learning (DRL)
Key Findings
→ Evidence that a learning agent can continuously optimize asset allocation under dynamic uncertainty.
The loop illustrates continuous interaction:
agent learns optimal allocations as markets evolve.
“Benchmarking modern classification algorithms in credit scoring reveals that
machine learning models consistently outperform traditional statistical approaches.”
Research question:
Which machine learning methods perform best for credit scoring tasks?
| Model | Mean AUC (Avg Across Datasets) | Comments |
|---|---|---|
| Logistic Regression | ≈ 0.75 | Benchmark industry standard |
| Decision Tree | 0.77 – 0.80 | Simple nonlinear model |
| Random Forest | 0.82 – 0.85 | Robust ensemble performance |
| Gradient Boosting (GBM) | 0.84 – 0.88 | Top ranked accuracy and consistency |
| Neural Network | 0.80 – 0.86 | Competitive but data‑sensitive |
| SVM | ≈ 0.82 | Needs careful parameter tuning |
Result Summary
The pipeline illustrates how ML models estimate PD and support risk‑based lending decisions.
“The economic content of business news articles provides a real‑time,
data‑driven measure of macroeconomic conditions and expectations.”
Background
Macroeconomic indicators (GDP growth, industrial production) are published infrequently and with lags.
→ Hard to observe the business cycle in real time.
Idea
Financial and business news articles contain massive information
about current economic activity and sentiment.
Research Questions
1. Can we extract quantitative signals from high‑dimensional news text data?
2. Do these “news factors” predict macro and asset‑market fluctuations?
Each factor corresponds to an underlying latent economic narrative that summarizes contemporaneous news coverage.
| Category | Finding |
|---|---|
| Macro tracking | The first news factor correlates ≈ 0.8 with industrial production growth and NBER recession dates. |
| Forecasting power | News factors predict future GDP growth, unemployment changes, and credit spreads 1–12 months ahead. |
| Asset pricing link | Expected returns and valuation ratios co‑move with news‑inferred cycle states. |
| Real‑time advantage | News data update daily → timely signals vs lagged macro statistics. |
Example:
Sharp declines in the “economic activity” factor precede official recessions by ≈ 3 months.
Text as Macro Sensor
News coverage aggregates dispersed information from firms, markets, and policymakers.
→ Functions as a real‑time proxy for latent business conditions.
Complement to Traditional Signals
Combines qualitative insight with quantitative estimation.
Yields improved forecasting and nowcasting.
Implications
The loop shows continuous feedback between media information, real economy, and financial markets.
Bybee, L., Kelly, B., Manela, A., & Xiu, D. (2024).
Business News and Business Cycles. Journal of Finance, 79(4), 1919 – 1978.
“A computer program is said to learn from experience E, with respect to some tasks T, and performance measure P, if its performance at T, as measured by P, improves with experience E.”
— Tom Mitchell (1997)
| 方法 | 概念 | 金融意义 |
|---|---|---|
| LASSO | L1惩罚 | 特征选择:因子稀疏性 |
| Ridge | L2惩罚 | 稳健估计:防止过拟合 |
| Dropout | 随机隐藏节点 | 模型泛化 |
Two common guiding principles and two methods are used to reduce overfitting:
K-fold cross-validation
Leave-one-out cross-validation:
|
((150, 4), (150,))
((90, 4), (90,))
((60, 4), (60,))
|
|
array([0.96..., 1. , 0.96..., 0.96..., 1. ])
0.98 accuracy with a standard deviation of 0.02
array([0.96..., 1. ..., 0.96..., 0.96..., 1. ]) |
|
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)
[0 1 2] [3] |
| Type | Metrics | Meaning |
|---|---|---|
| Regression | MSE, MAE, R² | Fit quality |
| Classification | Accuracy, Precision, Recall, F1 | Discrimination |
| Ranking | AUC, ROC | Signal quality |
| Portfolio | Sharpe ratio | Risk-adjusted performance |
All models are wrong, but some models are useful.
--- George Box
Shallow → Deep → Reinforcement → BigData → Finance → Causal → LLMs & Agents
| 层面 | 培养目标 |
|---|---|
| 技术 | 掌握常见ML算法与评估 |
| 金融 | 结合金融数据场景建模 |
| 研究 | 理解方法背后的经济含义 |
| Lecture | Theme | Focus |
|---|---|---|
| 1 | Introduction | 背景与框架 |
| 2 | Shallow Learning | 经典统计学习 |
| 3 | Deep Learning | 表征学习 |
| 4 | Reinforcement Learning | 决策建模 |
| 5 | Big Data | 文本与图像分析 |
| 6 | Asset Pricing & ML | 实证与量化应用 |
| 7 | Causal Inference | 因果与解释 |
| 8 | LLMs & Agents | 智能体与AI前沿 |
### [Breiman’s Two Cultures](https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full): Data Modeling vs. Algorithmic Modeling | Dimension | Econometrics | Machine Learning | |------------|---------------|------------------| | Focus | Parameter inference | Prediction accuracy | | Model | Structural assumptions | Algorithmic optimization | | Evaluation | Significance tests | Out-of-sample performance | | Data | Small, structured | Large, high-dimensional |