Lecture 01

Introduction to Financial Machine Learning

“Explore how data, algorithms, and learning are transforming finance.”

Financial Machine Learning · Lecture 1

Outlines

Financial Machine Learning · Lecture 1

Part 1 · Why Machine Learning in Finance?

Motivation

  • 金融行业从 假设驱动 转向 数据驱动
  • “Big Data → Big Compute → Big Finance”
  • 非结构化数据正在改变研究与实践模式

ML as an engine for decision-making under uncertainty.

Financial Machine Learning · Lecture 1

The Changing Financial Landscape

  • 数据类型:
    • 结构化(价格、交易、报表)
    • 非结构化(新闻、语音、图像)
  • 案例:
    • 文本数据
    • 互联网和社交媒体数据
    • 卫星与图像数据
    • 气象与环境数据
    • 高维金融数据

Finance becomes an information science.

Financial Machine Learning · Lecture 1

文本数据类研究(Textual Analysis):Loughran & McDonald (2011, JF)

When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10‑Ks.

  • 数据(大数据应用)
    • 超过 50,000 份美国上市公司 10‑K 报告文本(非结构化语言数据)
    • 来源:SEC EDGAR 数据库
  • 数据处理方法
    • 自建 金融专用情感词典 (LM Dictionary)
    • 自动分词、去停用词、词频统计;
    • 构造正面、负面、不确定性指标。
  • 意义
    • 奠定金融文本分析标准;
    • 将传统文本转化为可量化的大数据指标;
    • 对信息披露和市场反应研究影响深远。
Financial Machine Learning · Lecture 1

文本数据类研究(Textual Analysis):Hassan et al. (2019, QJE)

Firm‑Level Political Risk.

  • 数据
    • 约 8,000 家美国公司的 10‑K 报告文本(1995–2017);
  • 处理方法
    • 使用主题建模 (Topic Modeling) 自动识别政治风险语义主题;
    • 构造“企业层面政治风险指数 (PRisk)”;
    • 整合宏观政策事件与公司层面决策数据。
  • 意义
    • 证明文本大数据可生成新型风险测度;
    • 推动“文本→风险→行为”量化路径;
    • 是政治与宏观不确定性研究的基础。
Financial Machine Learning · Lecture 1

文本数据类研究(Textual Analysis):Hoberg & Phillips (2016, JPE)

Text‑Based Network Industries and Endogenous Product Differentiation.

  • 数据
    • 数千家上市公司 10‑K 产品描述文本。
  • 处理方法
    • 文本相似度分析 (余弦距离、TF‑IDF);
    • 构建企业间“文本语义相似网络”;
    • 代替 SIC/NAICS ,定义产业边界。
  • 意义
    • 将语言大数据转化为经济结构映射;
    • 揭示产业结构和竞争模式的动态演化;
    • “文本+网络”成为结构性建模新方向。
Financial Machine Learning · Lecture 1

互联网与社交媒体数据:Da, Engelberg & Gao (2011, Journal of Finance)

In Search of Attention.

  • 数据
    • Google Search Volume Index (SVI) 2004‑2010
    • 数千只股票的搜索量时间序列(上百万观测点)
  • 方法
    • 把 SVI 作为投资者注意力代理变量
    • 多期回归与截面分析。
  • 意义
    • 首次引入互联网用户行为数据;
    • 体现“大数据行为金融”;
    • 建立实时注意力测度工具。
Financial Machine Learning · Lecture 1

互联网与社交媒体数据:Bollen, Mao & Zeng (2011, Journal of Computational Science)

Twitter Mood Predicts the Stock Market.

  • 数据
    • 980余万条 Twitter 消息
    • 覆盖全球用户文本(2008‑2009)
  • 方法
    • NLP 情绪词典分析(6 类情绪);
    • 构造“集体情绪指数”;
    • Granger 因果检验 预测 DJIA 走势。
  • 意义
    • 首次验证社交媒体情绪与市场波动关系;
    • 将实时非结构化数据用于金融预测。
Financial Machine Learning · Lecture 1

互联网与社交媒体数据:Ke, Kelly & Xiu (2021 WP; 2024 RFS)

Predicting Returns with Text Data.

  • 数据
    • 约 1,200 万篇 Dow Jones Newswires 新闻;
    • 高频专业媒体文本流。
  • 方法
    • 提出 SESTM (Sentiment Extraction via Screening & Topic Modeling)
    • 监督学习模型,用股票收益信号训练文本情绪。
  • 意义
    • 机器学习+金融文本融合;
    • 优于词典与专家打分;
    • 代表金融文本算法化预测的前沿。
Financial Machine Learning · Lecture 1

卫星与图像数据(Remote Sensing):Henderson, Storeygard & Weil (2012, AER)

Measuring Economic Growth from Outer Space

  • 数据
    • 全球 夜间灯光卫星影像 (DMSP‑OLS) ;
    • 时间:1992 – 2008,覆盖 180 多个国家。
  • 方法
    • 灯光亮度校正(去噪声、跨卫星校准);
    • 灯光强度  GDP 回归;
    • 面板模型估计照明‑产出弹性。
  • 意义
    • 用遥感影像替代 GDP 测度;
    • 弥补统计空缺,推动“卫星数据经济学”;
    • 开创影像大数据在经济研究的先河。
Financial Machine Learning · Lecture 1

卫星与图像数据(Remote Sensing):Jean et al. (2016, Science)

Combining Satellite Imagery and Machine Learning to Predict Poverty.

  • 数据
    • 多源卫星数据(白天高分辨率影像 + 夜光);
    • 覆盖非洲数国 > 6 万 个村庄。
  • 方法
    • 卷积神经网络 (CNN) + 迁移学习;
    • 先通过夜光特征训练 CNN ,再预测贫困概率地图。
  • 意义
    • 展示 AI + 遥感影像在经济发展测度的潜力;
    • 用公开大数据生成贫困分布,为发展评估提供新手段。
Financial Machine Learning · Lecture 1

气象与环境数据(Weather‑Based Big Data):DeHaan, Madsen & Piotroski (2017, JAR)

Do Weather‑Induced Moods Affect the Processing of Earnings News?

  • 数据
    • NOAA GHCN‑Daily : > 1000 气象站、50 年数据;
    • 匹配公司所在地及公告日期。
  • 方法
    • 构造“异常天气指数”;
    • 面板回归检验天气对财报反应的影响。
  • 意义
    • 高维跨空间数据融合;
    • 揭示情绪‑信息处理机制;
    • 结构化气象数据的典型大数据应用。
Financial Machine Learning · Lecture 1

气象与环境数据(Weather‑Based Big Data):Hirshleifer & Shumway (2003, Journal of Finance)

Good Day Sunshine: Stock Returns and the Weather.

  • 数据
    • 26 个城市 × 多年 × 日度 云量/日照 观测;
    • 全球股票市场指数。
  • 方法
    • 日频面板模型;
    • 天气与市场回报的交叉检验。
  • 意义
    • 首次跨国使用高频气象面板;
    • 建立天气‑情绪假说全球证据。
Financial Machine Learning · Lecture 1

气象与环境数据(Weather‑Based Big Data):Novy‑Marx (2014, JFE)

Predicting Anomaly Performance with Politics, the Weather, Sunspots and the Stars.

  • 数据
    • 多源时间序列(天气、气候、天文、政治等);
    • 结合股票因子回报。
  • 方法
    • 多变量面板回归;
    • 高维外生因子稳健性分析。
  • 意义
    • 展示多源大数据用于验证与稳健测试;
    • 示范“大数据反思”模型。
Financial Machine Learning · Lecture 1

机器学习与高维金融数据:Gu, Kelly & Xiu (2020, RFS)

Empirical Asset Pricing via Machine Learning.

  • 数据
    • 数百个股票特征 × 数千只股票 × 几十年高维面板。
  • 方法
    • 系统比较 Lasso、Random Forest、Boosting、NN;
    • 特征重要性排名;
    • 构建 ML 因子模型。
  • 意义
    • 非线性特征选择与预测最优化;
    • 奠定机器学习资产定价基础;
    • 实证金融进入“算法‑大数据”阶段。
Financial Machine Learning · Lecture 1

机器学习与高维金融数据:Bryzgalova, Pelger & Zhu (2025, Journal of Finance)

Forest through the Trees: Building Cross‑Sections of Stock Returns.

  • 数据
    • 数万只股票因子特征(横截面高维数据)。
  • 方法
    • 随机森林+因子降维 构建“Tree‑based Factor Model”;
    • 将非线性特征映射为可解释定价因子。
  • 意义
    • 实现因子建模的非线性化与机器学习化;
    • 推动大数据资产定价框架的前沿发展。
Financial Machine Learning · Lecture 1

金融与大数据


来源: Nagel S. Machine learning in asset pricing[M]. Princeton University Press, 2021.

  • 大数据(4v)
    • Volume: The amount of data collected in files, records, and tables is very large, representing many millions, or even billions, of data points (GB->TB->PB->EB->ZB).
    • Velocity: The speed with which the data are communicated is extremely great. Real- time or near- real- time data have become the norm in many areas.
    • Variety: The data are collected from many different sources and in a variety of formats, including structured data (e.g., SQL tables or CSV files), semi-structured data (e.g., HTML code), and unstructured data (e.g., video messages).
    • Value: Low value density.

Prices are Predictions

“Asset prices reflect collective market expectations of future cash flows.”
— SSRN (2023)

  • 金融研究核心 = 预测未来收益分布
  • ML = 高维非线性预测工具
Financial Machine Learning · Lecture 1

Summary — Why ML?


  • 大规模复杂数据 → 传统方法受限
    • 高维、非线性
    • 多样、复杂
  • ML 提供灵活工具:非线性建模、变量筛选、稳健预测
  • 跨界互动:金融×计算×统计
Financial Machine Learning · Lecture 1

Part 2 · The Two Cultures: Econometrics vs. ML

“There are two cultures in the use of statistical modeling:
one assumes the data are generated by a given stochastic model;
the other uses algorithmic models to fit the data.”
— Leo Breiman (2001)

Financial Machine Learning · Lecture 1

Core Idea

Dimension Data‑Modeling Culture Algorithmic‑Modeling Culture
Goal Explain relationships, estimate parameters Predict outcomes, improve accuracy
Approach Assume a stochastic model 
 and estimate parameters
Learn the mapping 
 directly from data
Typical 
Methods
Linear/Logistic Regression, 
ARIMA, Probit
Decision Trees, Random Forest, 
Neural Networks
Evaluation Significance tests, Confidence intervals Out‑of‑sample error, 
Cross‑validation performance
Assumptions Small samples, low dimension, 
strong structure
Large samples, 
high dimension, weak assumptions
Focus Parameter interpretability (“Why”) Predictive power (“What”)
Financial Machine Learning · Lecture 1

Breiman’s Arguments


  • Traditional statistics relies too heavily on model assumptions, often poorly representing real‑world complexity.

  • The algorithmic culture gives priority to empirical prediction, not theoretical form.

  • For real applications (speech, vision, finance), algorithmic models outperform classical ones.

  • Statistical education and research should emphasize prediction, validation, and model performance rather than significance alone.

Financial Machine Learning · Lecture 1

Implications for Finance

  • Econometric culture: builds theory‑driven structural models
    (e.g., CAPM, Fama–French factors, VAR).
  • Machine‑learning culture: learns complex, nonlinear relations from large data
    (e.g., Random Forest for factor selection, LSTM for forecasting, BERT for text sentiment).
  • Synthesis:
    • ML → improves predictive accuracy
    • Econometrics → provides interpretation and causal mechanisms

Prediction and causality are complements, not rivals.

Further Readings

  • Breiman (2001) “Statistical Modeling: The Two Cultures.” Statistical Science, 16(3), 199–231.
  • Varian (2014) “Big Data: New Tricks for Econometrics.” Journal of Economic Perspectives.
  • Athey & Imbens (2019) “The State of Applied Econometrics: Causality and Machine Learning.”
Financial Machine Learning · Lecture 1

Part 3 · The Landscape of ML Methods

Three Paradigms

  • Supervised Learning: ML algorithms that infer patterns between a set of inputs (the ’s) and the desired output () with a labeled data set.
  • Unsupervised Learning: machine learning that does not make use of labeled data. In unsupervised learning, inputs (’s) are used for analysis without any target () being supplied. The algorithm seeks to discover structure within the data themselves. Two important types of problems in unsupervised learning are dimension reduction and clustering.
  • Reinforcement Learning: a computer learns from interacting with itself (or data generated by the same algorithm).

Pattern discovery at different levels of supervision.

Financial Machine Learning · Lecture 1

Supervised Learning

  • Definition
    • The task is to learn a mapping from inputs to outputs
    • The inputs are also called the features, covariates, or predictors
    • The outputs are also called the label, target, or response
    • The experince is the training set
Financial Machine Learning · Lecture 1

Supervised Learning: Classification

classification problem

  • the output space is a set of unordered and mutually exclusive labels known as classes, .
  • The problem is also called pattern recognition.
  • binary classification: just two classes, often denoted by
Financial Machine Learning · Lecture 1

Example: classifying Iris flowers

Financial Machine Learning · Lecture 1

Example: Image Classification

Financial Machine Learning · Lecture 1

Exploratory data analysis

  • exploratory data analysis: see if there are any obvious patterns
  • tabular data with a small number of features: pair plot
  • higher-dimensional data: dimension reduction first and then to visualize the data in 2d or 3d
Financial Machine Learning · Lecture 1

Learning a classifier

  • decision rule via a 1 dimensional (1d) decision boundary

  • decision tree a more sophisticated decision rule involves a 2d decision surface
Financial Machine Learning · Lecture 1

Empirical risk minimization

  • misclassification rate on the training set:

  • loss function:
  • empirical risk: e the average loss of the predictor on the training set

  • model fitting / training via empirical risk minimization

Financial Machine Learning · Lecture 1

Uncertainty

We must avoid false confidence bred from an ignorance of the probabilistic nature of the world, from a desire to see black and white where we should rightly see gray.

--- Immanuel Kant, as paraphrased by Maria Konnikova

  • Two types of uncertainties
    • epistemic uncertainty or model uncertainty: due to lack of knowledge of the input-output mapping
    • aleatoric uncertainty or data uncertainty: due to intrinsic (irreducible) stochasticity in the mapping
  • We can capture our uncertainty using the following conditional probability distribution:

Financial Machine Learning · Lecture 1

Maximum likelihood estimation

  • likelihood function:

  • log likelihood function

  • Negative Log Likelihood: The average negative probability of the training set.

  • the maximum likelihood estimate (MLE):

Financial Machine Learning · Lecture 1

Regression

  • the output space: .
  • loss function: quadratic loss, or loss:

  • mean squared error or MSE:

Financial Machine Learning · Lecture 1
  • An Example
    • Uncertainty: Guassian / Normal

  • the conditional dist

  • NLL

Financial Machine Learning · Lecture 1

(Simple) Linear regression

  • functional form of model:

  • parameters:
  • least square estimator:

Financial Machine Learning · Lecture 1

Polynomial regression

  • functional form of model:

Financial Machine Learning · Lecture 1
  • feature preprocessing, or feature engineering

Financial Machine Learning · Lecture 1

Deep neural networks

Financial Machine Learning · Lecture 1
  • deep neural networks (DNN): a stack of L nested functions:

  • the function at layer :
    • the final layer:

    • the learned feature extractor:
Financial Machine Learning · Lecture 1

Overfitting and generalization

  • Underfitting means the model does not capture the relationships in the data.
  • Overfitting means the model begins to incorporate noise coming from quirks or spurious correlations
    • it mistakes randomness for patterns and relationships
    • memorized the data, rather than learned from it
    • high noise levels in the data and too much complexity in the model
    • complexity refers to the number of features, terms, or branches in the model and to whether the model is linear or non-linear (non-linear is more complex).
Financial Machine Learning · Lecture 1
  • empirical risk

  • population risk

  • generalization gap:

  • test risk

Financial Machine Learning · Lecture 1

Evaluating ML Algorithm Performance Errors & Overfitting

  • Data scientists decompose the total out-of-sample error into three sources:
    • Bias error, or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and high in-sample error.
    • Variance error, or how much the model's results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance, causing overfitting
    • Base error due to randomness in the data.
Financial Machine Learning · Lecture 1

learning curve

  • A learning curve plots the accuracy rate (= 1 – error rate) in the validation or test samples (i.e., out-of-sample) against the amount of data in the training sample
Financial Machine Learning · Lecture 1

fitting curve

  • A fitting curve, which shows in-and out-of-sample error rates ( and ) on the -axis plotted against model complexity on the -axis
Financial Machine Learning · Lecture 1

Unsupervised learning

  • unsupervised learning: “inputs” without any corresponding “outputs” .
  • the task: fitting an unconditional model of the form

When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 neural connections. And you only live for 109 seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.

--- Geoffrey Hinton, 1996

Financial Machine Learning · Lecture 1

Unsupervised learning: Clustering

  • finding clusters in data: partition the input into regions that contain “similar” points.
Financial Machine Learning · Lecture 1

Discovering latent “factors of variation”

  • Assume that each observed high-dimensional output was generated by a set of hidden or unobserved low-dimensional latent factors .

  • factor analysis (FA)

  • principal components analysis (PCA):

  • nonlinear models: neural networks

Financial Machine Learning · Lecture 1

Reinforcement learning

  • online / dynamic version of machine learning
    • the system or agent has to learn how to interact with its environment
    • RL is closely related to the Markov Decision Process (MDP)
Financial Machine Learning · Lecture 1

Markov Decision Process

The MDP is the sequence of random variables () which describes the stochastic evolution of the system states. Of course the distribution of () depends on the chosen actions.

  • denotes the state space of the system. A state is the information which is available for the controller at time . Given this information an action has to be selected.

  • denotes the action space. Given a specific state at time , a certain subclass of actions may only be admissible.

Financial Machine Learning · Lecture 1
  • is a stochastic transition kernel which gives the probability that the next state at time is in the set B if the current state is and action a is taken at time .

  • gives the (discounted) one-stage reward of the system at time if the current state is and action a is taken

  • gives the (discounted) terminal reward of the system at the end of the planning horizon.

Financial Machine Learning · Lecture 1

A control is a sequence of decision rules () with where determines for each possible state the next action at time . Such a sequence is called policy or strategy. Formally the Markov Decision Problem is given by

Financial Machine Learning · Lecture 1
  • Types of MDP problems:

    • finite horizon () vs. infinite horizon ()
    • complete state observation vs. partial state observation
    • problems with constraints vs. without constraints
    • total (discounted) cost criterion vs. average cost criterion
  • Research questions:

    • Does an optimal policy exist?
    • Has it a particular form?
    • Can an optimal policy be computed efficiently?
    • Is it possible to derive properties of the optimal value function analytically?
Financial Machine Learning · Lecture 1

Applications of MDP: Comsumption Problem

Suppose there is an investor with given initial capital. At the beginning of each of periods she can decide how much of the capital she consumes and how much she invests into a risky asset. The amount she consumes is evaluated by a utility function as well as the terminal wealth. The remaining capital is invested into a risky asset where we assume that the investor is small and thus not able to influence the asset price and the asset is liquid. How should she consume/invest in order to maximize the sum of her expected discounted utility?

Financial Machine Learning · Lecture 1

Applications of MDP: Cash Balance or Inventory Problem

Imagine a company which tries to find the optimal level of cash over a finite number of periods. We assume that there is a random stochastic change in the cash reserve each period (due to withdrawal or earnings). Since the firm does not earn interest from the cash position, there are holding cost for the cash reserve if it is positive, but also interest (cost) in case it is negative. The cash reserve can be increased or decreased by the management at each decision epoch which implies transfer costs. What is the optimal cash balance policy?

Financial Machine Learning · Lecture 1

Applications of MDP: Mean-Variance Problem

Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after periods, the one with smallest portfolio variance. What is the optimal investment strategy?

Financial Machine Learning · Lecture 1

Applications of MDP: Dividend Problem in Risk Theory

Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?

Financial Machine Learning · Lecture 1

Applications of MDP: Bandit Problem

Suppose we have two slot machines with unknown success probability and . At each stage we have to choose one of the arms. We receive one Euro if the arm wins, else no cash flow appears. How should we play in order to maximize our expected total reward over trials?

Financial Machine Learning · Lecture 1

Applications of MDP: Pricing of American Options

In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.

Financial Machine Learning · Lecture 1

Algorithm Overview (1/2)

Category Example Algorithms Finance Applications
Regression OLS, LASSO, Partial Least Squares (PLS), Principal Component Regression (PCR), Random Forest Asset pricing, risk models, factor selection
Classification Logistic Regression, SVM, XGBoost, MLP Credit scoring, fraud detection
Clustering k-means, hierarchical clustering Market segmentation, regime detection
Dimensionality Reduction PCA, Autoencoder, PLS, PCR Factor extraction, latent risk modeling
Financial Machine Learning · Lecture 1

Algorithm Overview (2/2)

Category Example Algorithms Finance Applications
Advanced Causal / Structural ML Double Machine Learning (DML) Treatment effects, policy evaluation, causal inference in finance
Representation Learning (Deep) CNN, RNN, Transformer, GNN Text & news analytics, time-series forecasting, ESG image analysis
Generative Models GAN, Variational Autoencoder Scenario simulation, synthetic financial data
Reinforcement Learning (RL) Q-learning, PPO Portfolio optimization, trading strategies
Financial Machine Learning · Lecture 1

Selecting ML Algorithms

Variables Supervised
(Target Variable)
Unsupervised
(No Target Variable)
Continuous Regression
• Linear; Penalized
Regression/LASSO
• Logistic
• Classification and
Regression Tree (CART)
• Random Forest
Dimensionality Reduction
• Principal Components
Analysis (PCA)
Clustering
• K-Means
• Hierarchical
Categorical Classification
• Logit
• Support Vector Machine (SVM)
• K-Nearest Neighbor (KNN)
• Classification and
Regression Tree (CART)
Dimensionality Reduction
• Principal Components
Analysis (PCA)
Clustering
• K-Means
• Hierarchical
Continuous or
Categorical
Neural Networks
Deep Learning
Reinforcement Learning
Neural Networks
Deep Learning
Reinforcement Learning

source: CFA Curriculum
Financial Machine Learning · Lecture 1

Course Roadmap (8 Lectures)

  1. Introduction to Financial ML (today)
  2. Shallow Learning Algorithms
  3. Deep Learning Algorithms
  4. Reinforcement Learning
  5. Big Data & ML in Finance
  6. Empirical Asset Pricing & ML
  7. Causal Inference & ML
  8. LLMs & Agents for Finance
Financial Machine Learning · Lecture 1

Part 4 · ML in Financial Practice

Financial Machine Learning · Lecture 1

Case 1 · Empirical Asset Pricing via Machine Learning

Gu, Kelly & Xiu (2020, Review of Financial Studies)

"Machine learning extracts return‑predictive structure from the
high‑dimensional space of firm characteristics."

Financial Machine Learning · Lecture 1

Motivation & Research Question

  • Background:
    Empirical asset pricing traditionally relies on a few linear factors (e.g., Fama–French 3F/5F).
    Yet literature has discovered hundreds of potential predictors → the “factor zoo.”

  • Challenges:

    • High‑dimensional  problem
    • Multicollinearity and unstable coefficients
    • Very low out‑of‑sample R² (≈ 0%)
  • Core question:

    Can machine learning methods find stable and economically meaningful prediction patterns
    in the cross‑section of expected stock returns?

Financial Machine Learning · Lecture 1

Model and Data

  • Data

    • U.S. equities (1957 – 2016) monthly.
    • ~94 firm characteristics + 30 macro variables → input features .
    • Predict one‑month‑ahead returns .
  • Machine Learning Models
    1. Regularized linear models (LASSO, Ridge, Elastic Net)
    2. Tree‑based models (Random Forest, Gradient Boosting)
    3. Neural networks (Feedforward MLP)

  • Estimation Design

    • Cross‑sectional ML fit each month: 
    • Rolling OOS validation & economic performance checking.
Financial Machine Learning · Lecture 1

Predictive Performance

Model Out‑of‑Sample R² Description
OLS ≈ 0 % Traditional linear benchmark
LASSO / Ridge 0.5 – 0.8 % Regularized linear structure
Elastic Net / NN ≈ 1.2 % Best predictive models
RF / GBRT ≈ 1.0 % Nonlinear tree ensemble

Portfolio Implications

  • Portfolios sorted on ML‑predicted returns earn significant alphas.
  • Sharpe ratio up to 2× Fama‑French 5 Factor benchmark.
  • Predictive patterns robust over sub‑periods and market conditions.
Financial Machine Learning · Lecture 1

Economic Interpretation

  • ML models automatically rediscover and combine known signals

    • value, momentum, profitability, investment, liquidity, sentiment.
  • Capture nonlinear interactions and time‑varying relations often ignored by linear models.

  • The learned function  implies an estimated stochastic discount factor (SDF)
    consistent with no‑arbitrage pricing.

Regularization = statistical discipline; ML = economic flexibility.

Financial Machine Learning · Lecture 1

Takeaways

  • Machine learning reframes asset pricing as a forecasting problem.
  • Moves the field from small, parametric models to data‑driven, nonlinear mappings.
  • Provides a bridge between Breiman’s “algorithmic modeling culture”
      and financial econometrics.

Bottom Line:

ML techniques deliver higher predictive accuracy, robust SDF estimates, and new ways to understand risk premia.

Financial Machine Learning · Lecture 1

Reference

Gu, S., Kelly, B., and Xiu, D. (2020).
Empirical Asset Pricing via Machine Learning.
Review of Financial Studies, 33(5): 2223 – 2273.

  • Data and code available at https://dachxiu.github.io/
  • Highly influential study in financial machine learning literature
Financial Machine Learning · Lecture 1

Case 2 · Portfolio Optimization

Dixon et al. (2020) · Deep Reinforcement Learning in Finance

“The goal is to let a learning agent autonomously adjust portfolio weights to
maximize risk‑adjusted returns through interaction with the market environment.”

Financial Machine Learning · Lecture 1

Motivation · Traditional vs Reinforcement Approaches

  • Classical portfolio optimization (Markowitz 1952):
    One‑shot mean–variance trade‑off given expected returns and covariances.
    → Static optimization; requires ex ante distributional assumptions.

  • Real investment environments:

    • Dynamic and uncertain markets
    • Transaction costs and constraints
    • Need for adaptive, sequential decisions
  • Reinforcement Learning (RL):
    Learns an optimal policy through trial and error to maximize cumulative reward.

Financial Machine Learning · Lecture 1

Reinforcement Learning Framework in Portfolio Choice

Element Financial Interpretation
Environment (state ) Market features at time  — prices, returns, volatility, macro signals
Agent’s action  Portfolio rebalancing vector (weights or trades)
Reward  Risk‑adjusted return (e.g., Sharpe ratio or log‑utility gain)
Policy  Mapping from observed state to allocation decision
Goal Maximize expected discounted cumulative reward 

Training loop (Agent  Market):
1. Observe market state → 2. Select action (rebalance) → 3. Receive reward → 4. Update policy.

Financial Machine Learning · Lecture 1

Methodology Highlights (from Dixon et al., 2020)

Model Type: Deep Reinforcement Learning (DRL)

  • State vector : log returns, volatility, moving averages, momentum signals, past allocations.
  • Action vector : continuous weights subject to , .
  • Reward function:

    balances return and risk penalization.
  • Policy learning: Deep Deterministic Policy Gradient (DDPG) or Proximal Policy Optimization (PPO).
  • Exploration vs exploitation: ε‑greedy or stochastic policy sampling to improve learning efficiency.
Financial Machine Learning · Lecture 1

Empirical Setup & Results

  • Data: Daily prices of S&P 500 constituents (or ETF basket).
  • Training: 1990 – 2010 rolling window; testing 2011 – 2019.
  • Benchmarks: Equal Weight (EW), Mean–Variance, Risk‑Parity strategies.

Key Findings

  • DRL models achieve higher annualized Sharpe ratios (≈ 1.5–2×) over benchmarks.
  • Policies adaptively shift into safe assets during drawdowns.
  • Transaction cost robustness built via penalty in reward function.

→ Evidence that a learning agent can continuously optimize asset allocation under dynamic uncertainty.

Financial Machine Learning · Lecture 1

Interpretation & Insights

  • RL formulation unifies forecasting and control:
    estimates both expected returns and optimal actions within an adaptive feedback loop.
  • Learned policy ≈ data‑driven dynamic extension of Markowitz:

  • The agent’s behavior is interpretable: pattern‑based rebalancing (more risk aversion in high volatility).
  • Challenges: sample‑efficiency and stability of deep RL; interpretability vs. black‑box learning.
Financial Machine Learning · Lecture 1

Visualization — Agent  Market  Reward Loop


flowchart LR A[Market State $s_t$] -->|observe| B["Agent Policy π(a|s)"] B -->|action $a_t$| C[Portfolio Rebalancing] C -->|new prices| D[Market Environment] D -->|reward $r_t$| A

The loop illustrates continuous interaction:
agent learns optimal allocations as markets evolve.

Financial Machine Learning · Lecture 1

Takeaways

  • Reinforcement learning extends portfolio theory to a multi‑period, feedback‑driven world.
  • DRL agents outperform static benchmarks by incorporating nonlinear dynamics and adaptive risk control.
  • Combining ML forecast power (GKX 2020) with RL decision optimization (Dixon 2020) forms the core of modern Financial AI Infrastructure.
Financial Machine Learning · Lecture 1

References

  • Dixon, M., Halperin, I., and Bilkay, P. (2020). Machine Learning in Finance: From Theory to Practice. Springer.
  • Moody, J. & Saffell, M. (2001). Learning to Trade via Direct Reinforcement. IEEE Trans. NN.
  • Li, Y., and Dixon, M. (2020). Deep Reinforcement Learning for Dynamic Portfolio Optimization. Applied Stochastic Models in Business and Industry.
Financial Machine Learning · Lecture 1

Case 3 · Credit Risk Modeling

Lessmann et al. (2015) · EJOR

“Benchmarking modern classification algorithms in credit scoring reveals that
 machine learning models consistently outperform traditional statistical approaches.”

Financial Machine Learning · Lecture 1

Motivation & Problem

  • Credit risk modeling is a core task in banking: estimate each applicant’s probability of default (PD).
  • Classical scorecards use logistic regression for simplicity and explainability.
  • However, in large‑scale datasets, relationships between borrower features and default risk are:
    • Nonlinear
    • High‑dimensional
    • Containing strong interactions

Research question:

Which machine learning methods perform best for credit scoring tasks?

Financial Machine Learning · Lecture 1

Data & Experimental Design

  • Data: 8 public and proprietary credit datasets (≈ 30k – 60k observations each).
    • Personal loan and credit‑card applications.
    • Features: demographics, financial ratios, repayment history.
    • Target: default (1) vs non‑default (0).
  • Algorithms benchmarked (8+):
    1. Logistic Regression (baseline)
    2. Decision Trees
    3. Random Forest
    4. Gradient Boosting (MART / GBM)
    5. Support Vector Machines (SVM)
    6. Neural Networks (MLP)
    7. k‑Nearest Neighbors (kNN)
    8. Naïve Bayes
  • Evaluation metrics: AUC, Brier Score, Accuracy, Top‑decile capture, Generalization stability.
Financial Machine Learning · Lecture 1

Key Empirical Findings

Model Mean AUC (Avg Across Datasets) Comments
Logistic Regression ≈ 0.75 Benchmark industry standard
Decision Tree 0.77 – 0.80 Simple nonlinear model
Random Forest 0.82 – 0.85 Robust ensemble performance
Gradient Boosting (GBM) 0.84 – 0.88 Top ranked accuracy and consistency
Neural Network 0.80 – 0.86 Competitive but data‑sensitive
SVM ≈ 0.82 Needs careful parameter tuning

Result Summary

  • Ensemble methods (GBM, RF) systematically outperform traditional statistics.
  • Models with regularization and probability calibration yield most stable PD estimates.
  • No single method wins universally → model selection should consider data size and imbalance.
Financial Machine Learning · Lecture 1

Interpretation & Insights

  • Gradient Boosting captures complex nonlinear feature interactions + robust ranking of PD.
  • Neural Networks can approximate nonlinear decision boundaries but require more data and tuning.
  • Explainable versions (e.g., feature importance, partial dependence plots) help meet regulatory transparency requirements.
  • The study provides a universal benchmark still used by banks and fintechs to validate new ML credit models.
Financial Machine Learning · Lecture 1

Visualization — Credit Scoring Pipeline


flowchart LR A["Loan Application Data"] --> B["Preprocessing & Feature Engineering"] B --> C["Classification Algorithms (Logit, GBM, NN)"] C --> D["Probability of Default (PD Score 0-1)"] D --> E["Performance Metrics (AUC, PRC, KS)"] E --> F["Credit Decision / Capital Allocation"]

The pipeline illustrates how ML models estimate PD and support risk‑based lending decisions.

Financial Machine Learning · Lecture 1

Takeaways

  • Machine learning enhances credit risk prediction — especially ensemble methods like GBM and RF.
  • Benchmark study (265 experiments) shows consistent gains over logistic baseline.
    - Supports industry trend to adopt modern ML within Basel IRB framework with interpretability tools.
  • Foundation for later “Explainable AI in Credit Risk” literature.
Financial Machine Learning · Lecture 1

Reference

  • Lessmann, S., Baesens, B., Seow, H.‑V., & Thomas, L. C. (2015). Benchmarking State‑of‑the‑Art Classification Algorithms for Credit Scoring. European Journal of Operational Research, 247(1), 124–136.
  • Benchmark datasets and code: http://www.creditriskanalytics.net
  • Referenced as the go‑to comparison study for modern credit scoring models.
Financial Machine Learning · Lecture 1

Case 4 · Business News and Business Cycles

Bybee · Kelly · Manela · Xiu (2024 · Journal of Finance)

“The economic content of business news articles provides a real‑time,
data‑driven measure of macroeconomic conditions and expectations.”

Financial Machine Learning · Lecture 1

Motivation & Research Question

  • Background
    Macroeconomic indicators (GDP growth, industrial production) are published infrequently and with lags.
    → Hard to observe the business cycle in real time.

  • Idea
    Financial and business news articles contain massive information
    about current economic activity and sentiment.

  • Research Questions
    1. Can we extract quantitative signals from high‑dimensional news text data?
    2. Do these “news factors” predict macro and asset‑market fluctuations?

Financial Machine Learning · Lecture 1

Data and Methodology

Data Construction

  • 1.3 million articles from Wall Street Journal, Bloomberg, Reuters** (1980 – 2021).
  • Articles tagged as Business or Economy topics.
  • Monthly aggregation; bag‑of‑words representation TF–IDF.

Dimensionality Reduction

  • Apply Principal Component Analysis (PCA) to the word‑document term matrix.
  • Built three orthogonal news factors:
    - : General Economic Activity
    - : Financial Conditions / Credit Stress
    - : Uncertainty / Policy News

Interpretation

Each factor corresponds to an underlying latent economic narrative that summarizes contemporaneous news coverage.

Financial Machine Learning · Lecture 1

Empirical Findings

Category Finding
Macro tracking The first news factor correlates ≈ 0.8 with industrial production growth 
and NBER recession dates.
Forecasting power News factors predict future GDP growth, unemployment changes, 
and credit spreads 1–12 months ahead.
Asset pricing link Expected returns and valuation ratios co‑move with news‑inferred cycle states.
Real‑time advantage News data update daily → timely signals vs lagged macro statistics.

Example:
Sharp declines in the “economic activity” factor precede official recessions by ≈ 3 months.

Financial Machine Learning · Lecture 1

Economic Interpretation

  • Text as Macro Sensor
    News coverage aggregates dispersed information from firms, markets, and policymakers.
    → Functions as a real‑time proxy for latent business conditions.

  • Complement to Traditional Signals
    Combines qualitative insight with quantitative estimation.
    Yields improved forecasting and nowcasting.

  • Implications

    • Central banks and investors can use news‑based indices for timelier policy and risk decisions.
    • Bridges NLP methods with macroeconomic modeling.
Financial Machine Learning · Lecture 1

Visualization — Text → Factors → Economy Loop


flowchart LR A[Business News Articles(WSJ/Reuters)] --> B[Text Processing · TF–IDF Matrix] B --> C[PCA → Latent News Factors F₁–F₃] C --> D[Economic Cycle Tracking & Forecasting] D --> E[Asset Markets · Policy Signals] E --> A

The loop shows continuous feedback between media information, real economy, and financial markets.

Financial Machine Learning · Lecture 1

Takeaways

  • High‑frequency news text contains rich economic signals.
  • Machine‑learning (PCA on TF‑IDF space) constructs interpretable news factors.
  • News‑based indices track and lead the business cycle in real time.
  • Demonstrates how NLP bridges information flows between the press and macro‑finance.
Financial Machine Learning · Lecture 1

Reference

Bybee, L., Kelly, B., Manela, A., & Xiu, D. (2024).
Business News and Business Cycles. Journal of Finance, 79(4), 1919 – 1978.

  • Data and code available at https://dachxiu.github.io/
  • Extends ML approaches to macro forecasting using large‑scale textual data.
Financial Machine Learning · Lecture 1

Part 5 · Fundamentals of Machine Learning

What is Machine Learning?

“A computer program is said to learn from experience E, with respect to some tasks T, and performance measure P, if its performance at T, as measured by P, improves with experience E.”
Tom Mitchell (1997)

Financial Machine Learning · Lecture 1

Learning Setup

  • 数据集:
  • 模型:
  • 目标:最小化经验风险

Financial Machine Learning · Lecture 1

Bias–Variance Tradeoff

  • 偏差:模型过于简化,拟合不足
  • 方差:模型过度复杂,噪声敏感
  • 平衡:正则化 + 验证集
  • 图示:U形误差曲线
Financial Machine Learning · Lecture 1

Overfitting & Regularization

方法 概念 金融意义
LASSO L1惩罚 特征选择:因子稀疏性
Ridge L2惩罚 稳健估计:防止过拟合
Dropout 随机隐藏节点 模型泛化
Financial Machine Learning · Lecture 1

Part 6 · Model Evaluation & Validation

Train / Validation / Test Split

  • 学习的最终目标不是拟合训练集,而是预测新数据
  • 金融特例:时间序列切分法
Financial Machine Learning · Lecture 1

Evaluating ML Algorithm Performance: Preventing Overfitting in Supervised ML

  • Two common guiding principles and two methods are used to reduce overfitting:

    • preventing the algorithm from getting too complex during selection and training (regularization)
    • proper data sampling achieved by using cross-validation
  • K-fold cross-validation

    • data (excluding test sample and fresh data) are shuffled randomly and then are divided into k equal sub-samples
    • samples used as training samples and one sample used as a validation sample
    • is typically set at 5 or 10
    • This process is repeated times. The average of the validation errors (mean ) is taken as a reasonable estimate of the model's out-of-sample error ()
  • Leave-one-out cross-validation:

Financial Machine Learning · Lecture 1

Cross-validation using sk-learn: train_test_split

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

((150, 4), (150,))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape

((90, 4), (90,))

X_test.shape, y_test.shape

((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
Financial Machine Learning · Lecture 1

Cross-validation using sk-learn: computing cross-validated metrics

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.96..., 1. , 0.96..., 0.96..., 1. ])

print("%0.2f accuracy with a standard deviation 
    of %0.2f" % (scores.mean(), scores.std()))

0.98 accuracy with a standard deviation of 0.02

from sklearn import metrics
scores = cross_val_score(
    clf, X, y, cv=5, scoring='f1_macro')
scores

array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])

Financial Machine Learning · Lecture 1

Time Series Split

from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]
    , [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)

for train, test in tscv.split(X):
    print("%s %s" % (train, test))

[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

Financial Machine Learning · Lecture 1

Evaluation Metrics

Type Metrics Meaning
Regression MSE, MAE, R² Fit quality
Classification Accuracy, Precision, Recall, F1 Discrimination
Ranking AUC, ROC Signal quality
Portfolio Sharpe ratio Risk-adjusted performance
Financial Machine Learning · Lecture 1

"No Free Lunch" Theorem

All models are wrong, but some models are useful.

--- George Box

  • No free lunch theorem: There is no single best model that works optimally for all kinds of problems
  • pick a suitable model
    • based on domain knowledge
    • trial and error
      • cross-validation
      • Bayesian methods selection techniques
Financial Machine Learning · Lecture 1

Part 7 · From ML to Course Framework

Course Learning Map

Shallow → Deep → Reinforcement → BigData → Finance → Causal → LLMs & Agents
  • 难度与抽象层逐步提升
  • 应用层从预测 → 解释 → 智能体
Financial Machine Learning · Lecture 1

Learning Objectives

层面 培养目标
技术 掌握常见ML算法与评估
金融 结合金融数据场景建模
研究 理解方法背后的经济含义
Financial Machine Learning · Lecture 1

8-Lecture Overview

Lecture Theme Focus
1 Introduction 背景与框架
2 Shallow Learning 经典统计学习
3 Deep Learning 表征学习
4 Reinforcement Learning 决策建模
5 Big Data 文本与图像分析
6 Asset Pricing & ML 实证与量化应用
7 Causal Inference 因果与解释
8 LLMs & Agents 智能体与AI前沿
Financial Machine Learning · Lecture 1

Part 8 · Outlook & Readings

  • 大语言模型 (LLMs) 与金融智能体
  • 生成式AI:从模型到智能决策系统
  • 强调可解释性与伦理透明性
Financial Machine Learning · Lecture 1
  • Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Jonathan Taylor. An Introduction to Statistical Learning with Applications in Python[M]. Springer Cham, 2023.
  • Murphy K P. Probabilistic machine learning: an introduction[M]. MIT press, 2022.
  • Gaillac C, L'Hour J. Machine Learning for Econometrics[M]. Oxford University Press, 2025.
  • Dixon M F, Halperin I, Bilokon P. Machine learning in Finance[M]. Springer International Publishing, 2020.
  • Nagel S. Machine learning in asset pricing[M]. Princeton University Press, 2021.
  • de Prado M M L. Machine learning for asset managers[M]. Cambridge University Press, 2020.
  • Ashwin Rao, Tikhon Jelvis. Foundations of Reinforcement Learning with Applications in Finance[M]. Stanford University, 2022.
  • Denev A, Amen S. The Book of Alternative Data: A Guide for Investors, Traders and Risk Managers[M]. John Wiley & Sons, 2020.
Financial Machine Learning · Lecture 1

Final Takeaways

  • ML 是金融研究的范式变革
  • “预测”与“解释”可以兼容共进
  • 下一讲:Shallow Learning Algorithms
    → Regression, Classification, Clustering, and Dimension Reduction in Finance
Financial Machine Learning · Lecture 1

### [Breiman’s Two Cultures](https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full): Data Modeling vs. Algorithmic Modeling | Dimension | Econometrics | Machine Learning | |------------|---------------|------------------| | Focus | Parameter inference | Prediction accuracy | | Model | Structural assumptions | Algorithmic optimization | | Evaluation | Significance tests | Out-of-sample performance | | Data | Small, structured | Large, high-dimensional |