Lecture 02

Shallow Learning Algorithms

“Explore how shallow learning algorithms provide the foundation for predictive and data‑driven finance.”

Financial Machine Learning · Lecture 2

Outlines

Financial Machine Learning · Lecture 2

Part 1 · Regression Models

Motivation

  • 金融中的多数预测问题(如资产收益、波动率、风险因子)是连续型。
  • 从线性回归到正则化模型(Ridge / LASSO)提升稳健性与解释性。
  • 面对高维因子,机器学习帮助实现更稳定的 out‑of‑sample 预测。

Regression as a tool for continuous prediction and signal extraction in finance.

Financial Machine Learning · Lecture 2

Linear Regression Review

  • Objective: predict a continuous output  from features 

  • Ordinary Least Squares (OLS):

  • Assumptions: linearity, i.i.d. errors, no multicollinearity

  • OLS is easy but overfits when  or features correlated

  • Property: BLUE

  • Extension:Weighted least squares

金融应用示例:

  • 股票横截面收益预测 (Gu et al., 2020)
  • 宏观变量对风险溢价的解释
Financial Machine Learning · Lecture 2

Regularization : Ridge & Lasso

Method Penalty Typical Use
Ridge Regression Shrinks coefficients, handles multicollinearity
LASSO Variable selection → sparse model
Elastic Net Combines both advantages
Financial Machine Learning · Lecture 2

Post‑LASSO & Two‑Step Estimation

  • Core Idea: Use LASSO for variable selection → refit OLS only on the selected features.
  • Procedure:
    1. Step 1: Run LASSO and define support set

    1. Step 2: Estimate OLS restricted to selected variables

  • Intuition:
    • LASSO = shrink + select
    • Post‑LASSO = remove shrinkage bias by OLS refit
    • If support  recovers true set  → asymptotically oracle OLS.
Financial Machine Learning · Lecture 2

LASSO vs. Post‑LASSO (OLS after Selection)

Aspect LASSO Post‑LASSO OLS
Penalty  shrinkage 
None
Bias (估计偏差) 向0收缩→系统性偏差 基本无偏 (若选集正确)
Variance 较大
Prediction Error 偏差–方差 折中 更接近 Oracle 性能 (取决于 )
Interpretation 系数偏小,难解释 OLS系数可直接经济解释
Use Case 特征数量极多、
偏重预测
关注解释或统计推断

Empirical Finance Applications

  • Asset Pricing: LASSO 筛选预测因子 → Post‑OLS 估计 SDF
    Gu, Kelly & Xiu (2020, RFS)
  • Treatment Effect / Causal Inference:
    High‑dim controls + OLS after LASSO selection
    Belloni, Chernozhukov & Hansen (2014, Restud)
  • Cross‑Sectional Predictability:
    Panel‑wise Post‑LASSO
    Kelly, Pruitt & Su (2019, RFS)

Key Takeaway:

“Use Machine Learning for selection, Econometrics for estimation.”

Financial Machine Learning · Lecture 2

Structured Regularization · Group LASSO

Motivation:

  • In many financial datasets, variables are naturally grouped (e.g., industry sectors, factor families, macro categories).
  • Traditional LASSO treats each feature individually which may break the structure.
  • ➜ Group LASSO encourages sparsity at the group level.

Formulation

where  collects coefficients within group , and  is a group weight.

Key Idea:

  • If → entire group is excluded.
  • Produces block‑sparse solutions → selects whole economic themes or industry factors.
Financial Machine Learning · Lecture 2

Comparison with Standard LASSO

Aspect LASSO Group LASSO
Sparsity level Variable wise Group wise
Penalty
Structure 
information
Ignored Incorporated
Interpretability Individual predictors Economic factor blocks
Typical use Generic feature selection Multi‑factor or 
hierarchical models

Empirical Finance Applications

  • Grouped Factor Selection: value, momentum, profitability themes.
  • Industry or sector relevance: select entire sectors with common drivers.
  • Panel data hierarchies: allow shared loadings within firm or time groups.
  • Interaction terms / polynomial basis: retain entire interaction bundles.

Benefits: Higher stability, better economic interpretability, improved prediction under grouped designs.

Financial Machine Learning · Lecture 2

Ridge · LASSO · Elastic Net Feature and Use Case Comparison


Aspect Ridge LASSO Elastic Net
Variable Selection No Yes Group‑wise
Handles Collinearity Strong May drop one of correlated vars Good compromise
Bias vs Variance Low variance, high bias Higher variance, 
low bias for selected vars
Balanced
Parameter Count All kept Few non‑zero Moderate
Interpretation Shrunk coefficients Sparse interpretation Group selection
Finance Use Case Yield curve, 
risk premia with collinearity
High‑dim. feature screening Thematic factor 
groups
Financial Machine Learning · Lecture 2

Model Training Pipeline

  1. Standardize features to unit variance
      (Regularization is scale‑sensitive)
  2. Cross‑validation for  and 
    • LassoCV · RidgeCV · ElasticNetCV
    • Use K‑fold CV (or TimeSeriesSplit for panel data)
  3. Model evaluation via OOS MSE / R² / feature support
  4. (Optional)Post‑LASSO OLS for unbiased estimation
from sklearn.linear_model import 
      LassoCV, RidgeCV, ElasticNetCV

# Examples
model = ElasticNetCV(
          l1_ratio=[.1,.5,.9,1]
          , alphas=None, cv=5)
model.fit(X_train, y_train)
best_alpha = model.alpha_
best_ratio = model.l1_ratio_
Financial Machine Learning · Lecture 2

Typical Financial Applications and Key Takeaways

  • Typical Financial Applications

    • Cross‑Sectional Return Forecasting: LASSO / Elastic Net for factor screening
    • Yield Curve / Term Structure: Ridge for strong collinearity in maturities
    • Grouped Factors (Value/Momentum/Profitability): Elastic Net or Group LASSO
    • Credit Risk Modeling: Elastic Net for stable prediction and interpretability
  • Key Takeaways

    • Ridge: keep all → stability
    • LASSO: sparsity → interpretability
    • Elastic Net: best of both worlds
    • Model selection by CV → balance bias‑variance
    • For inference → use Post‑LASSO OLS
Financial Machine Learning · Lecture 2

Nonlinear Regression: Polynomial Regression



  • nonlinear in s
  • still linear in the parameters (s)
  • similar to multi-linear regression
  • polynomial function imposes global structure
Financial Machine Learning · Lecture 2

Nonlinear Regression: Step functions

  • no global structure
  • break the range of into bins

  • fit a different constant in each bin

  • unless there are natural breakpoints in the predictors, piecewise-constant functions can miss the action
Financial Machine Learning · Lecture 2

Nonlinear Regression: Basis functions

  • Polynomial and piecewise-constant regression models are special cases of a basis function approach.
  • Basis fucntion: a family of functions or transformations that can be applied to a variable : .
  • The model

  • some examples of basis fucntions
    • polynomial function
    • piecewise constant function
    • wavelet
    • Fourier series
    • splines
Financial Machine Learning · Lecture 2

Nonlinear Regression: Regression splines

Piecewise Polynomials

  • fitting separate low-degree polynomials over different regions of .
  • example: piecewise cubic polynomial with a single knot at a point

    • degree of freedom
    • Using more knots leads to a more flexible piecewise polynomial

Constraints and Splines

  • piecewise cubic: no constraint
  • continuous piecewise cubic: continuity of
  • cubic spline: continuity of , and
  • degree- spline: a piecewise degree- polynomial, with continuity in derivatives up to degree at each knot
Financial Machine Learning · Lecture 2

The Spline Basis Representation

  • regression spline:

  • the spline basis
    • polynomial basis: , , and
    • truncated power basis for each knot

  • splines can have high variance at the outer range of the predictors
  • natural spline
    • a regression spline with additional boundary constraints
    • the function is required to be linear at the boundary
    • natural splines generally produce more stable estimates at the boundaries
Financial Machine Learning · Lecture 2
Financial Machine Learning · Lecture 2

Choosing the Number and Locations of the Knots

  • locations of knots (given the number fixed)

    • more knots -> the function might vary most rapidly;
    • fewer knots -> seems more stable
    • in practice: place knots in a uniform fashion
      • specify the desired degrees of freedom
      • software automatically place knots
  • number of knots

    • try out diferent numbers of knots
    • cross-validation
Financial Machine Learning · Lecture 2

Comparison to Polynomial Regression

  • natural cubic spline with 15 degrees of freedom vs. degree-15 polynomial
  • natural cubic spline works better on boundaries
  • in general, natural cubic spline produces more stable estimates
Financial Machine Learning · Lecture 2

Smoothing splines

  • fitting a curve: minimize RSS to be small.
  • should be a smooth function (WHY? & HOW?)
  • smoothing spline minimize the following objective

  • the smoothing spline is a natural cubic spline with knots at
    • piecewise cubic polynomial with knots at the unique values of
    • continuous first and second derivatives at each knot
    • linear in the region outside of the extreme knots
    • it is a shrunken version of such a natural cubic spline
Financial Machine Learning · Lecture 2

Nonlinear Regression: Local regression

computing the fit at a target point using only the nearby training observations

Financial Machine Learning · Lecture 2

Local linear regression

Algorithm: Local Regression At
1. Gather the fraction of training points whose are closest to .
2. Assign a weight to each point in this neighborhood, so that the point furthest from has weight zero, and the closest has the highest weight. All but these nearest neighbors get weight zero.
3. Fit a weighted least squares regression of the on the using the aforementioned weights, by finding and that minimize
4. The fitted value at is given by .
Financial Machine Learning · Lecture 2

Nonlinear Regression: Generalized additive models

  • the multiple linear regression model

  • GAM

  • Example

    • year and age are quantitative variables
    • education is qualitative with five levels: , HS, <Coll, Coll, >Coll
Financial Machine Learning · Lecture 2

natural spline

smoothing spline

Financial Machine Learning · Lecture 2

Pros and Cons of GAMs

  • Pros
    • GAMs automatically model non-linear relationships that standard linear regression will miss.

    • The non-linear fits can potentially make more accurate predictions for the response .

    • We can examine the effect of each on individually while holding all of the other variables fixed.

    • The smoothness of the function for the variable can be summarized via degrees of freedom.

  • Cons: the model is restricted to be additive.
Financial Machine Learning · Lecture 2

Evaluation of Regression Models

Metric Formula Interpretation
MSE Overall fit
MAE Robust to outliers
1 – SSE/TSS Fraction of variance explained
CV K‑fold Assess OOS performance

Key Idea:
Always evaluate on out‑of‑sample or hold‑out set to avoid spurious fit.

Financial Machine Learning · Lecture 2

Part 2 · Classification Algorithms

Motivation

  • 许多金融决策是离散性的:违约/不违约、涨/跌、风险/安全。
  • 需要估计分类概率与不平衡样本性能(如信用评分、欺诈检测)。
  • 从 Logit 到 SVM 再到 Ensemble 方法,强化金融判断的精度与可解释性。

Classification as a foundation for binary decision‑making under risk and uncertainty.

Financial Machine Learning · Lecture 2

Classification Problem

  • Predict discrete label  (or multi‑class).

    • The classifier estimates .
  • Examples in finance:

    • 信用违约 (default / no default)
    • 欺诈检测 (fraud yes/no)
    • 市场回报趋势 (up/down)
  • Regression is not appropriate for classification tasks

    • regression methods can not accommodate a qualitative response with more than two classes
    • regression methods can not provide estimation of the conditional probability of responses
Financial Machine Learning · Lecture 2

Example: The default dataset



default student balance income
1 No No 729.5264952 44361.62507
2 No Yes 817.1804066 12106.1347
3 No No 1073.549164 31767.13895
4 No No 529.2506047 35704.49394
5 No No 785.6558829 38463.49588

source: ISLP

Financial Machine Learning · Lecture 2

The logistic model

  • The probability of default

  • Linear regression

  • logistic function

source: ISLP

Financial Machine Learning · Lecture 2

Multiple logistic regression



  • the model of odds

  • the model of

Coefcient Std. error z-statistic p-value
Intercept −10.8690 0.4923 −22.08 <0.0001
balance 0.0057 0.0002 24.74 <0.0001
income 0.0030 0.0082 0.37 0.7115
student[Yes] −0.6468 0.2362 −2.74 0.0062
Financial Machine Learning · Lecture 2

prediction

  • A student with a credit card balance of $1,500 and an income of $40, 000 has an estimated probability of default of

  • A non-student with the same balance and income has an estimated probability of default of

Financial Machine Learning · Lecture 2

Multinomial logistic regression

  • classify a response variable that has more than two classes

  • the model

    • for : the baseline

    • for

  • the log odds (for )

Financial Machine Learning · Lecture 2

Generative models for classification

  • The bigidea of generative models for classification

    • model the distribution of the predictors separately in each of the response classes
    • use Bayes' theorem to flip these around into estimates for
  • Why do we need the generative models for classification

    • When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable
    • If the distribution of the predictors is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression
    • The methods in this section can be naturally extended to the case of more than two response classes
Financial Machine Learning · Lecture 2
  • Suppose the qualitative response variable can take on possible distinct and unordered values.
  • Let represent the overall or prior probability that a randomly chosen observation comes from the th class.
  • Let denote the density function of for an observation that comes from the th class
  • the posterior probability (The Bayes' Theorem)

  • estimation
    • instead of directly computing the posterior probability , we can simply plug in estimates of and
    • : we simply compute the fraction of the training observations that belong to the th class
    • : is much more challenging
Financial Machine Learning · Lecture 2

Linear discriminant analysis (LDA) for

  • assumptions
    • only one predictor:
    • is normal / Gaussian,
  • the posterior

  • prediction
    • classify an observation to the class for which is greatest
    • an equivalent law: to assigning the observation to the class for which maximize

Financial Machine Learning · Lecture 2

An Example

  • and
  • assigns an observation to class 1 if , and to class 2 otherwise
  • The Bayes decision boundary

Financial Machine Learning · Lecture 2

LDA for

The multivariate Gaussian distribution

  • jointed density


  • LHS:

  • RHS: correlated / have unequal variances
Financial Machine Learning · Lecture 2
  • the observations in the th class are drawn from a multivariate Gaussian distribution
    • is a class-specific mean vector
    • is a covariance matrix that is common to all classes
  • the Bayes classifier assigns an observation to the class for which maximizes

  • The Bayes decision boundary solves

Financial Machine Learning · Lecture 2

Quadratic discriminant analysis (QDA)

  • each class has its own covariance matrix

  • the Bayes classifier assigns an observation to the class for which maximizes

Financial Machine Learning · Lecture 2

Naive Bayes

  • Assumption: Within the k-th class, the p predictors are independent

  • the posterior probability

  • Estimating the one-dimensional density function using training data

    • If is quantitative, then we can assume that
    • If is quantitative use a non-parametric estimate for
      • making a histogram for the observations of the th predictor within each class
      • kernel density estimator
    • If is qualitative count the proportion of training observations for the th predictor corresponding to each class
Financial Machine Learning · Lecture 2

Generalized additive models


Model the log odds ratio as a generalized additive models:

Financial Machine Learning · Lecture 2

Support Vector Machine


  • developed in the 1990s
  • perform well in a variety of settings
  • often considered one of the best "out of the box" classifiers.
Financial Machine Learning · Lecture 2

Support Vector Machine: Maximal Margin Classifier

Hyperplane

  • In a -dimensional space: a flat affine subspace of dimension

    • In 2-d: a line
    • In 3-d: a plane
    • In -d (): hard to visualize
  • The mathematical definition (for the -d setting)

  • the 2-d example:

Financial Machine Learning · Lecture 2

Classification Using a Separating Hyperplane

  • Seperating hyperplane

  • aproperty: for all

  • we classify the test observation based on the sign of

    • class 1
    • class -1
  • the magnitude of
Financial Machine Learning · Lecture 2

The Maximal Margin Classifier

  • margin
  • maximal margin hyperplane (a.k.a. the optimal separating hyperplane)
  • Construction of the Maximal Margin Classifier

Financial Machine Learning · Lecture 2

The Non-separable Case & Noisy Data

  • sometimes data are non-seperateble
  • sometimes the maximal margin classifier is very sensitive to noisy data
Financial Machine Learning · Lecture 2

Support Vector Classifiers

  • : the i-th obs is on the correct side of the margin
  • : the i-th obs is on the wrong side of the margin
  • : the i-th obs is on the wrong side of the hyperplane
Financial Machine Learning · Lecture 2

Parameter

  • is the budget for the amount that the margin can be violated by the observations
    • : no budget for violations to the margin
    • : no more than obs can be on the wrong side of the hyperplane
    • : margin will widen
  • controls the bias-variance trade-off
    • small : low bias, high variance
    • big : high bias, low variance
    • selected via CV
  • An observation: only observations that either lie on the margin or that violate the margin (support vectors) will afect the hyperplane
Financial Machine Learning · Lecture 2

Support Vector Machines

Support Vector Machines can not handle nonlinearity.

What can we do?

Financial Machine Learning · Lecture 2

Nonlinear Classifiers Utilizing Polynomial Features

  • The original features

  • The polynomial features

  • The SVM via Optimization

Financial Machine Learning · Lecture 2

Kernel Functions

  • Definition:

  • is a kernel function if and only if the kernel matrix is positive semi-definite for any data .
name function
Linear kernel
Polynomial kernel
Radial kernel
Gaussian kernel
Laplacian kernel
Sigmoid kernel
Financial Machine Learning · Lecture 2

Suppose and are kernel functions:

  • The linear combination of and is a kernel function

  • The direct product of and is a kernel function

  • For any function , is a kernel function if

Financial Machine Learning · Lecture 2

SVC vs. SVM

SVC
SVM
inner products / kernels

functional form

Financial Machine Learning · Lecture 2

SVMs with More than Two Classes

  • One-Versus-One (OVO) Classification

    • a.k.a. all-pairs
    • constructs a SVM for each pair of classes
    • classify a test obs using each of the SVMs
    • assign the obs to the class to which it was most frequently assigned in
  • One-Versus-All (OVA) Classification

    • a.k.a. one-versus-rest
    • fit SVM for each class (coded as "1" and the rest was coded as "-1")
    • let denote the parameters
    • assign the observation to the class for which is largest
Financial Machine Learning · Lecture 2

Relationship to Logistic Regression

  • The hinge loss + penalty form of support-vector classifier optimization:
    • let
    • the optimization model

    • it is very similar to “loss” in logistic regression (negative log-likelihood).
  • SVM vs. Logistic Regression
    • When classes are (nearly) separable, SVM does better than LR. So does LDA.
    • When not, LR (with ridge penalty) and SVM very similar.
    • If you wish to estimate probabilities, LR is the choice.
    • For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive.
Financial Machine Learning · Lecture 2

Model Evaluation for Classification

Metric Aim
Confusion Matrix True/False pos & neg rates
Accuracy Overall classification rate
Precision & Recall (PR) Trade‑off critical for imbalanced data
ROC / AUC Ranking quality of probabilities
KS Statistic Discrimination for credit risk

金融实践 → PD 模型评价、欺诈检出率、风控敏感性分析。

Financial Machine Learning · Lecture 2

分类算法核心思想速览

方法 数学思想/假设 非线性能力 输出 主要优势
Logistic 
Regression
线性决策边界;估计 
概率 高度可解释、标准误可得
LDA 类条件正态且协方差相同 线性 概率 稳定、最小误差界
QDA 类条件正态但协方差不同 中高 概率 能拟合不同形状边界
Naïve Bayes 特征条件独立 概率 简单、高维文本类适用
GAM 多变量非线性可加 概率或期望 灵活且可解释
SVM 最大化间隔,核函数映射 离散决策 对噪声稳健、边界清晰
Financial Machine Learning · Lecture 2

经济与金融研究典型应用

场景 适用方法 说明
信用评分 / 违约预测 Logistic, GAM 监管认可、可解释概率输出
企业破产 / 风险等级 LDA, QDA 经典统计判别思路
欺诈检测 SVM, Naïve Bayes 高维特征、分类边界复杂
市场状态识别(牛/熊) SVM, GAM 可建非线性或时变边界
文本情绪正负分类 Naïve Bayes, SVM 高维稀疏词向量场景
宏观政策立场分类 Logistic, GAM 输出概率方便经济解释
Financial Machine Learning · Lecture 2

模型比较

比较维度 Logistic LDA QDA Naïve Bayes GAM SVM
可解释性
非线性能力
小样本性能 可能过拟合 需正则
高维特征容忍度 需正则 不佳 不佳 依赖核
输出形式 概率 概率 概率 概率 概率/期望 类别或间隔分数
计算效率 相对慢
监管接受度
典型数据结构 表格结构 连续特征 连续特征不同方差 离散文本/分类 多维非线性时序 高维非线性
Financial Machine Learning · Lecture 2

经济金融研究中“方法—场景”匹配表

任务 数据特点 推荐算法 原因
信用评分 / 违约概率 中小样本、易解释 Logistic or GAM 输出概率、可视化解释、合规
企业分类 / 财务风险层级 多变量但正态性可近似 LDA/QDA 经典实证传统
文本或信件分类 高维词频、稀疏 Naïve Bayes or SVM 对高维文本表现优
宏观经济状态判别 非线性、多因素 GAM or SVM 可捕捉非线性或边界变化
市场操纵 / 欺诈检测 噪声多、复杂模式 SVM or GAM 强非线性识别能力
Financial Machine Learning · Lecture 2

总结

  • Logistic / LDA = 统计解释型
  • GAM = 解释 + 灵活性兼得
  • SVM / Naïve Bayes = 预测导向型
  • QDA = 权衡方案(非线性但仍可解析)

建议:

  • 若论文或报告需计量模型严谨解释 → Logistic / GAM
  • 若应用场景为交易策略或异常检测 → SVM / Naïve Bayes
  • 若教学或传统信用研究 → LDA / QDA
  1. 解释性优先 → 计量学派
     Logistic ≫ GAM
     → 用于政策研究、监管、结构分析。

  2. 预测性能优先 → 机器学习派
     SVM ≫ QDA ≫ Naïve Bayes
     → 用于市场状态分类、风险监测。

  3. 混合场景 → 可加或半参数模型
     GAM 在预测与解释间平衡,经济研究特别常用。

Financial Machine Learning · Lecture 2

Part 3 · Tree Based Models

Motivation

  • 捕捉数据中非线性关系与变量交互效应。
  • Bagging / Boosting 等集成方法提升预测精度与鲁棒性。
  • 满足金融机构对模型透明度和可解释性监管要求。

Tree‑based methods combine accuracy with interpretability, bridging prediction and explanation in finance.

Financial Machine Learning · Lecture 2

Classification and regression trees (CART): Regression Trees

  • regression trees:
    • Structure: The tree is composed of nested decision rules; each node compares feature with threshold to route the input left or right.
    • Leaves: Each leaf defines the predicted output for inputs reaching that region.
  • An example:
    • the region of space:

    • the output for region 1 (mean response) can be estimated using

Financial Machine Learning · Lecture 2

Regression Trees: formal representation and training

  • Formally, a regression tree can be defined by

    • is the region specified by the 'th leaf node, is the predicted output for that node, and
    • The regions:; , etc.

Training objective: Find splits that maximize the reduction in MSE.

Categorical inputs: Splits compare feature to possible category values, rather than numeric thresholds.

Example recap

  • Feature splits — size → weight → color — partition the feature space into regions
  • The predicted value in each leaf is the average of the target values of samples within that region:

Financial Machine Learning · Lecture 2

CART: Classification Trees


  • classification trees:
    • Structure: The tree is composed of nested decision rules; each node splits by feature and threshold (numeric) or by category values.
    • Leaves: Each leaf stores class distribution and predicts the majority class.
  • An example:
    • Regions of space: , , etc.
    • Leaf outputs (class counts): : (4, 0) → class yes; : (1, 1) → tie; : (0, 2) → class no; : (4, 0) → class yes; : (0, 5) → class no
    • Prediction rule:

      • is estimated by the class proportion in .

Financial Machine Learning · Lecture 2

Classification Trees: formal representation and training

  • Formally, a classification tree can be written as

    • : region corresponding to the ‑th leaf,
    • : majority class (or class‑probability vector) in that region,
    • .

Training objective: Instead of minimizing squared error (as in regression trees), classification trees minimize impurity at each split. Common impurity measures:

  • Gini impurity:
  • Entropy:
  • Split is chosen to maximize impurity reduction.

Categorical inputs: Splits compare to category values rather than numeric thresholds.

Example recap:

  • Feature splits — color → shape → size — partition the data into regions .
  • Each region stores class counts (e.g., 4 yes, 0 no), and prediction is by majority vote.
Financial Machine Learning · Lecture 2

Regularization

  • the danger of overfitting: If we let the tree become deep enough, it can achieve 0 error on the training set (assuming no label noise), by partioning the input space into sufficiently small regions where the output is constant.
  • two main approaches against overfitting:
    • The first is to stop the tree growing process according to some heuristic, such as having too few examples at a node, or reaching a maximum depth.
    • The second approach is to grow the tree to its maximum depth, where no more splits are possible, and then to prune it back, by merging split subtrees back into their parent.
Financial Machine Learning · Lecture 2

Pros and cons

  • advantages:
    • They are easy to interpret.
    • They can easily handle mixed discrete and continuous inputs.
    • They are insensitive to monotone transformations of the inputs (because the split points are based on ranking the data points), so there is no need to standardize the data.
    • They perform automatic variable selection.
    • They are relatively robust to outliers.
    • They are fast to fit, and scale well to large data sets.
    • They can handle missing input features.
  • disadvantages:
    • they do not predict very accurately compared to other kinds of model (the greedy nature of the tree construction algorithm).
    • trees are unstable: small changes to the input data can have large effects on the structure of the tree, due to the hierarchical nature of the tree-growing process, causing errors at the top to affect the rest of the tree.
Financial Machine Learning · Lecture 2

Ensemble Learning: overview

  • Core Idea: Combine multiple base learners to produce a stronger, more stable model.

    • Individual models are imperfect — they may have high variance or bias.
    • By aggregating diverse models, random errors tend to cancel out. better generalization.
  • Why It Works

    • Error decomposition: similar bias + reduced variance (independence in base models)
    • Effect: improved prediction stability, robustness, and generalization.
  • Common Ensemble Methods

    Category How models are combined Goal Examples
    Averaging Train models independently and average or vote their predictions Reduce variance Bagging, Random Forests
    Boosting Train models sequentially, each focusing on previous errors Reduce bias AdaBoost, Gradient Boosting
    Stacking Learn a meta-model to optimally combine base model outputs Leverage diverse learners Stacking, Blending
Financial Machine Learning · Lecture 2

Regression Ensembles vs. Classification Ensembles

Regression Ensembles: averaging predictions

  • Idea: Combine outputs of multiple regression models by averaging.

    • Each is the prediction of the ‑th base model.
    • Averaging smooths out noise and reduces variance.
    • Works especially well when base models are high‑variance (e.g., decision trees).

Classification Ensembles: voting or prob. averaging

  • Idea: Combine multiple classifiers by majority vote or class‑probability averaging.

  • For probabilistic classifiers, use:

  • Voting reduces random classification errors and stabilizes predictions.
Financial Machine Learning · Lecture 2

Averaging-Based Ensembles: Bagging

  • Core Idea
    • Bagging (bootstrap aggregating) is an ensemble method that trains multiple base models on different bootstrap samples of the training data and then averages (or votes) their predictions.
    • Purpose: reduce the variance of high‑variance learners (e.g., trees) without increasing bias.
  • Algorithm (regression)
    1. Generate bootstrap samples :
      • Each sample is drawn with replacement from the original training set of size .
    2. Train a base model on each .
    3. For prediction:

  • Each bootstrap sample contains about 63% unique examples; the remaining 37% are out‑of‑bag (OOB) instances, useful for performance estimation.
Financial Machine Learning · Lecture 2
  • Why Bagging Works

    • Averaging independent (or weakly correlated) models cancels random fluctuations in predictions.
    • The ensemble becomes less sensitive to any single training example → variance reduction.
    • Bias remains roughly unchanged.
  • Notes

    • OOB predictions can estimate test‑set performance without cross‑validation.
    • Bagging may not help with stable learners (e.g., linear models), but greatly benefits unstable ones, such as decision trees.
    • Next: Random Forests extend bagging by introducing feature‑level randomness.
Financial Machine Learning · Lecture 2

Random Forests: Bagging with Feature Randomness

  • Core Idea
    • Random Forests extend Bagging by adding randomness in feature selection.
    • Each tree is trained on a bootstrap sample of the data,
      and at each split only a random subset of input features is considered.
    • The goal: further reduce correlation among trees → stronger variance reduction.
  • Algorithm
    1. For each tree :
      • Sample a bootstrap dataset (as in bagging).
      • Grow a decision tree :
        • At each split, randomly select features (where = total number of features).
        • Choose the best split only within those features.
    2. Predict by averaging (regression) or majority voting (classification):

Financial Machine Learning · Lecture 2
  • Why RF Helps
    • Decorrelation effect: Limiting feature choices makes trees more diverse, reducing ensemble variance beyond standard bagging.
    • Bias–variance trade‑off: Slight bias increase, but larger variance reduction → better test performance.
    • Interpretability bonus: Allows estimation of feature importance via split statistics or OOB error impact.
  • Notes
    • Random Forests are robust to noise and overfitting, even with many trees.
    • Empirically outperform plain bagging when input features are correlated.
    • Typical hyperparameters:
      • Number of trees
      • Number of features per split
      • Tree depth and minimum leaf size
Financial Machine Learning · Lecture 2

Boosting: focus on bias reduction

  • Core Idea
    • Boosting builds an ensemble sequentially,
      where each new model focuses on the errors made by previous ones.
    • Unlike Bagging or Random Forests (which are parallel and variance‑oriented),
      Boosting is adaptive — later models are trained to correct prior mistakes.
  • Mechanism
    1. Start with a simple base learner .
    2. Evaluate its prediction errors on the training data.
    3. Fit the next learner to emphasize misclassified or poorly predicted samples.
    4. Repeat this process for rounds — gradually improving overall accuracy.
    5. Combine all learners into a weighted sum (or vote):

Financial Machine Learning · Lecture 2

  • Why Boosting Works

    • Each new learner reduces the residual bias of the current ensemble.
    • Focused sequential fitting allows the model to capture complex patterns missed earlier.
    • As long as each base learner performs slightly better than random (weak learner),
      the final ensemble can become a strong learner.
  • Key Characteristics

    Aspect Boosting Bagging / RF
    Training style Sequential, adaptive Parallel, independent
    Main goal Reduce bias Reduce variance
    Model dependency Later models depend on earlier ones Models trained independently
    Typical base learners Weak (e.g., shallow trees) Unstable (e.g., deep trees)
    Example algorithms AdaBoost, Gradient Boosting Bagging, Random Forests
  • Notes

    • Boosting can sometimes overfit if trained too long — regularization and early stopping help.
    • Works best with simple base learners that underfit alone but combine effectively.

Financial Machine Learning · Lecture 2

Boosting: sequential learning algorithm

  • Core Algorithmic Idea
    • Boosting trains a sequence of base learners , each one focusing on examples where previous models performed poorly.
    • The process maintains a weight for each training example, emphasizing misclassified (or high‑error) cases as training progresses.
  • Generic Boosting Process (regression)
    1. Initialize: Assign equal weights for all training samples.
    2. Iterate for :
      • Train base learner on the weighted dataset.
      • Compute weighted error .
      • Determine learner weight (model importance).
      • Update sample weights: Increase weights for misclassified samples, decrease weights for correct ones.
      • Normalize so they sum to 1.
    3. Combine learners:

Financial Machine Learning · Lecture 2
  • Intuition
    • Correct predictions get less emphasis in the next step; difficult examples get more focus.
    • Each iteration shifts model capacity toward areas of previous error.
    • Effectively, the ensemble adapts to its weaknesses at each stage.
  • Why It Works
    • Aggregates many weak learners (slightly better than random) into a strong composite model.
    • The weighting mechanism acts like a residual correction process, steadily reducing bias.
    • Works particularly well with flexible but simple learners (e.g., shallow trees).
  • Notes
    • This generic framework underlies AdaBoost (for classification) and Gradient Boosting (for regression or differentiable losses).
Financial Machine Learning · Lecture 2

Forward Stagewise Additive Modeling (FSAM)

Model Definition: We seek an additive model combining multiple base learners:

  • : base learner (weak model)
  • : weight (step size or learner coefficient)
  • : total number of boosting iterations
  • The final model is built sequentially, adding one component at a time.

Optimization Framework: We minimize an empirical loss function in a stagewise manner:

At each stage :

  1. Given current model , find the next weak learner and its weight by

  2. Update the model:

This is a forward stagewise additive approach — each step performs a local optimization to reduce the total loss.

Financial Machine Learning · Lecture 2
  • Special Cases

    Algorithm Loss Function Interpretation
    AdaBoost Exponential loss:
    Weight update minimizing exponential risk
    Gradient Boosting Arbitrary differentiable loss Learners fit the negative gradient of w.r.t.
  • Summary

    • FSAM provides a unified mathematical view of boosting algorithms.
    • Different boosting methods differ only by their choice of loss function and how the next learner and step size are determined.
    • This framework bridges “statistical learning” and “optimization”: boosting gradient descent in function space.
Financial Machine Learning · Lecture 2

Gradient Boosting: algorithmic implementation

Core Idea

  • Gradient Boosting interprets boosting as performing gradient descent in function space.
  • Each iteration adds a new base learner that fits the negative gradient of the loss function with respect to the current model predictions.

  • Here, approximates the negative gradient of the loss function .
  • Gradient Boosting Algorithm: Given loss function :
    1. Initialize model

    2. For :
      a. Compute pseudo‑residuals (negative gradients):

      b. Fit base learner to training pairs .
      c. Find optimal step size

      d. Update the model

    3. Final model:
Financial Machine Learning · Lecture 2

Example: Quadratic loss and least squares boosting

  • squared error loss
  • the 'th term in the objective at step becomes

    • is the residual of the current model on the 'th observation.
  • We can minimize the above objective by simply setting , and fitting to the residual errors. This is called least squares boosting.
Financial Machine Learning · Lecture 2

  • Insight

    • Each weak learner corrects the direction of error from previous models.
    • The step size controls the “learning rate.”
    • With small learning rates, many iterations → smoother optimization trajectory.
    • Provides a flexible, loss‑agnostic boosting framework.
  • Typical Hyperparameters

    Parameter Role Common Range
    Number of boosting rounds 100–1000
    (learning rate) Shrinks step size 0.01–0.1
    Tree depth Base learner capacity 3–8
    Subsampling rate Regularization by randomness 0.5–1.0

  • Notes

    • Gradient Boosting includes popular algorithms such as GBDT, XGBoost, LightGBM, and CatBoost.
    • Strong performance with tabular data, but sensitive to hyperparameter tuning.

Financial Machine Learning · Lecture 2

Regularization and Enhancements in Gradient Boosting

  • Why Regularization Matters

    • Economic and financial data are often noisy, non‑stationary, and structurally unstable.
    • Classical Gradient Boosting can easily overfit training data, losing predictive power in new time periods or market regimes.
    • Regularization aims to: Control model complexity; Improve generalization; Enhance robustness.
  • Three Core Regularization Techniques

    Technique Key Idea Practical Effect Example in Financial Research
    (1) Shrinkage (learning rate) Reduce the update step after each iteration: , where . Each model contributes less individually. More iterations → smoother convergence. In credit risk modeling, a small learning rate (e.g., 0.05) prevents the model from fitting extreme or rare default cases too aggressively.
    (2) Subsampling (stochastic sampling) Use a random subset (e.g., 50–80%) of observations at each iteration. Adds randomness, reducing variance. Works like stochastic gradient descent. In high‑frequency trading forecasts, random subsampling mitigates market micro‑noise and avoids overfitting transient patterns.
    (3) Tree Constraints (structural control) Limit tree complexity—depth, number of leaves, or minimum leaf size. Reduces model flexibility, controlling overfitting. In macroeconomic forecasting, using shallow trees (depth ≤ 4) prevents the model from reacting to short‑term, non‑structural fluctuations.


Financial Machine Learning · Lecture 2
  • Regularization in Practice

    • These three methods are typically used together.
    • Common configurations:
      • Learning rate
      • Subsample rate 0.5–0.8
      • Tree depth 3–6
    • Combination smoother optimization path, lower variance, better out‑of‑sample stability.
  • Practical Implications: Robustness over Perfection

    Research Context Overfitting Risk Recommended Strategy
    Credit scoring (small sample, many predictors) High Small learning rate + shallow trees
    Macroeconomic forecasting Medium Subsampling + depth constraint
    High‑frequency trading prediction Very high Strong regularization + time‑segmented training
    Portfolio risk modeling Medium Conservative parameters + repeated cross‑validation


Financial Machine Learning · Lecture 2
  • Summary
    • Regularization is the key to robust financial machine learning.
      • Shrinkage controls the step size
      • Subsampling introduces randomness
      • Tree constraints cap model complexity.
    • Together, they stabilize model performance in environments with structural changes and time variation, helping ensure that results remain economically interpretable and policy‑relevant.
Financial Machine Learning · Lecture 2

Stacking: focus on model diversity and cross‑model synergy

  • Core Idea
    • Stacking builds an ensemble hierarchically, combining predictions from multiple different algorithms through a higher‑level meta‑model.
    • Unlike Bagging or Boosting (which aggregate similar base models), stacking integrates heterogeneous learners — each capturing a distinct structure or assumption.
    • The meta‑model learns how much to trust each base learner according to its performance.
  • Mechanism
    1. Split the dataset into several folds (e.g., via K‑fold cross‑validation).
    2. Train diverse base learners (Level‑1 models): linear regression, random forest, gradient boosting, SVM, neural net.
    3. Generate out‑of‑sample predictions from these base models on validation folds.
    4. Use these predictions as meta‑features to train a meta‑model (Level‑2), typically a linear regression or regularized learner.
    5. Final prediction: , are base models and is the meta‑model.

Financial Machine Learning · Lecture 2

  • Why Stacking Works

    • Different algorithms learn complementary aspects of financial data — e.g., linear structures, threshold effects, or interaction terms.
    • The meta‑model balances their strengths, correcting systematic errors of any single model.
    • By integrating multiple perspectives, Stacking achieves:
      • Higher robustness to model misspecification*
      • Better generalization out‑of‑sample
      • Lower dependence on one model’s bias
  • Key Characteristics

    Aspect Stacking Bagging Boosting
    Architecture Hierarchical (multi‑level) Parallel Sequential
    Base models Heterogeneous (different types) Homogeneous Homogeneous
    Dependency Meta‑model depends on base outputs Independent Step‑wise dependent
    Main goal Combine diverse modeling strengths Reduce variance Reduce bias
    Example meta‑model Linear / Ridge regression


Financial Machine Learning · Lecture 2
  • Notes
    • Stacking requires careful data partitioning to avoid information leakage.
    • Computationally heavier, but conceptually flexible — applicable to regression, classification, or time‑series tasks.
    • Particularly useful in finance, where combining linear (economic theory‑based) and nonlinear (data‑driven) models often improves stability.
Financial Machine Learning · Lecture 2

Stacking Implementation Details

Data Splitting Strategy — Avoiding Information Leakage

  • Purpose: ensure the meta‑model sees only out‑of‑sample base predictions.
  • Use K‑fold cross‑validation or time‑based folds for temporal data:
    1. Split data into folds.
    2. Train each base model on folds.
    3. Predict on the held‑out fold → collect out‑of‑fold predictions.
    4. Combine all folds to form the meta‑training set.
  • For financial time series:
    • Maintain chronological order (train on past → predict future).
    • Prevent “look‑ahead bias” and regime contamination.

Base Models — How to Choose

Objective Recommended Base Models Rationale
Linear behavior OLS, LASSO, Ridge Stable, interpretable baseline
Nonlinear patterns Random Forest, Gradient Boosting Capture interactions and thresholds
Dynamic effects Recurrent NN, Temporal Trees Handle temporal structure
Mixed data sources Logistic + Tree Ensemble Combine economic and market features

Design principle: choose diverse but complementary learners that reflect different economic structures.

Financial Machine Learning · Lecture 2
  • Meta‑Model — How to Train

    • Goal: learn the optimal combination of base predictions.
    • Common choices:
      • Linear / Ridge regression — interpretable, low variance.
      • Elastic Net / LASSO — enforces sparsity among base models.
      • Simple tree or GBM — flexible when relationships are nonlinear.
    • The meta‑model should be simpler than base models, focusing on aggregation rather than rediscovering patterns.
  • Workflow Summary

    Step 1  Split data into training folds
    Step 2  Train diverse base models → get out-of-fold predictions
    Step 3  Construct meta-features from base predictions
    Step 4  Train meta-model on these meta-features
    Step 5  Apply trained pipeline to test data or new time periods
    
Financial Machine Learning · Lecture 2
  • Example: Portfolio Risk Forecasting
    • Goal: predict 1‑month ahead portfolio volatility.
    • Base models:
      • Linear GARCH(1,1): captures conditional variance
      • Random Forest: leverages nonlinear return–factor relations
      • XGBoost: emphasizes tail risk and rare events
    • Meta‑model:
      • Ridge regression, trained on five‑fold out‑of‑sample predictions
    • Result: improved stability and fewer false volatility spikes versus any single model.
Financial Machine Learning · Lecture 2
  • Common Pitfalls

    Issue Description Mitigation
    Data leakage Meta‑model uses in‑sample predictions Strict fold separation or rolling‑window setup
    Over‑complex meta‑model Learns base model noise rather than signal Use regularized regressors
    Limited sample Too few observations to estimate second layer Reduce base model number or fold count
    Inconsistent scaling Base model outputs on different scales Standardize before meta‑training

  • Summary

    • Proper data partitioning is crucial — without it, stacking fails.
    • Base models provide structural diversity; meta‑model provides adaptive synthesis.
    • In finance, this layered design supports robust cross‑regime predictions and better economic interpretability.
Financial Machine Learning · Lecture 2

Ensemble Learning Overview — Bagging vs. Boosting vs. Stacking

  • Three Main Approaches at a Glance

    Aspect Bagging Boosting Stacking
    Core Strategy Parallel resampling and voting Sequential error correction Hierarchical model integration
    Model Dependency Independent learners Each learner depends on prior errors Meta‑level depends on base outputs
    Main Objective Reduce variance Reduce bias Combine diverse model strengths
    Typical Base Learners Unstable models (e.g., deep trees) Weak models (e.g., shallow trees) Mixed models (linear + nonlinear)
    Combining Rule Averaging / voting Weighted additive updates Meta‑model learns optimal weights
    Representative Algorithms Random Forest AdaBoost, GBM, XGBoost Stacked Generalization
    Bias–Variance–Diversity View ↓ Variance ↓ Bias ↑ Model Diversity


Financial Machine Learning · Lecture 2

  • How They Complement Each Other

    • Bagging / Random Forest: stabilize high‑variance estimators through resampling. → Reliable for noisy, high‑dimensional data (e.g., sentiment scores, market features).
    • Boosting: refine weak learners iteratively, focusing on difficult samples. → Effective where the signal is subtle or nonlinear, like credit scoring or rating predictions.
    • Stacking: integrate models of different natures (econometric and machine‑learning) → Most suitable for complex, multi‑structural problems such as GDP forecasts or systemic risk indices.
  • Guidance for Financial & Economic Applications

    Scenario Preferred Method Reason / Goal
    Credit risk modeling
    (imbalanced labels, tabular data)
    Boosting (e.g., XGBoost) Focuses on hard‑to‑classify defaults; handles feature interactions.
    Macroeconomic forecasting
    (few features, temporal structure)
    Bagging / Random Forest Reduces variance from small samples; robust to outliers.
    Market microdata or multi‑source models
    (prices, text, fundamentals)
    Stacking Integrates heterogeneous models; combines interpretability and flexibility.
    Portfolio optimization or volatility forecasting Stacking / Bagging mix Balances predictive stability with adaptability across regimes.


Financial Machine Learning · Lecture 2
  • Key Takeaways
    • All ensemble methods share the same philosophy: multiple weak models → collective strength.
    • Their distinction lies in how ensemble diversity is constructed:
      • Bagging: through data resampling (variance reduction)
      • Boosting: through iterative focusing (bias reduction)
      • Stacking: through model heterogeneity (diversity fusion)
    • In financial research, they collectively support robust, regime‑resilient predictions and encourage model pluralism — a critical principle in empirical economics.
Financial Machine Learning · Lecture 2

Part 4 · Unsupervised Learning

Motivation

  • 探索金融数据中的潜在结构与隐藏因子。
  • 降维简化复杂系统,聚类揭示市场模式与投资者群体。
  • 支撑风险分层与因子提取的前期分析。

Unsupervised learning as data‑driven exploration for structure discovery in financial systems.

Financial Machine Learning · Lecture 2

Unsupervised Learning — Introduction and Core Ideas

  • From Supervised to Unsupervised

    • Supervised learning: learns a mapping from labeled data (focused on prediction and inference).
    • Unsupervised learning: explores the structure within data itself (aimed at discovery, compression, and representation).
    • No predefined outcome variable; models uncover hidden patterns, groups, or relationships.
  • Core Philosophy

    Supervised Unsupervised
    Learns from known targets Learns from input structure only
    Goal: minimize error or loss Goal: maximize pattern clarity or compactness
    Typical tasks: regression, classification Typical tasks: clustering, dimensionality reduction, anomaly detection, association rules
    Oriented toward prediction Oriented toward understanding / exploration
Financial Machine Learning · Lecture 2
  • Example Intuitions

    • Clustering: grouping firms or customers with similar patterns. → “Which entities behave alike?”
    • Dimensionality Reduction: extracting key factors driving variation. → “What underlying forces move these variables together?”
    • Association Rules: finding co‑occurring events. → “If X happens, is Y likely to follow?”
    • Anomaly Detection: identifying rare or abnormal behavior. → “Which period or firm looks unusual?”
  • Why It Matters in Finance & Economics

    Application Area Example Use Case Benefit
    Market Structure Analysis Identify groups of stocks with similar return behavior reveal sector co‑movements
    Consumer Finance / Credit Cluster borrowers by spending and repayment patterns better segmentation, risk profiling
    Macroeconomics Extract hidden economic factors from multiple indicators simplify large datasets for policy analysis
    Fraud / Crisis Detection Spot anomalies in transaction or macro trends early warning and control
Financial Machine Learning · Lecture 2
  • Conceptual Analogy

    Supervised → teacher provides correct answers
    Unsupervised → students self-organize into study groups
    

    The “teacher” (labels) is absent — yet insights emerge from how data points relate to each other.
    This makes unsupervised methods ideal for exploratory analysis and hypothesis generation.

  • Transition

    In the following pages, we will explore major unsupervised methods:

    • Clustering → discovering similarity structures,
    • Dimensionality Reduction → summarizing complex variables,
    • Association Rules & Anomaly Detection → uncovering hidden relationships and outliers.

    These techniques turn raw, unlabeled data into interpretable economic knowledge.

Financial Machine Learning · Lecture 2

Clustering Methods — Overview (K‑Means, Hierarchical, DBSCAN)

  • Core Idea of Clustering
    • Goal: group observations into clusters such that objects in the same cluster are similar, while objects in different clusters are dissimilar.
    • The definition of “similarity” depends on the distance metric (e.g., Euclidean, cosine, Mahalanobis).
    • Clustering provides a structural view of the data, helping identify latent groups or regimes.
  • Main Approaches

    Method Key Principle Strengths Limitations
    K‑Means Minimize within‑cluster variance (inertia) Fast, simple, widely used Must pre‑specify K; sensitive to scale and outliers
    Hierarchical Clustering Merge or split clusters based on distance linkage (single, complete, average) Visual dendrogram; no preset K Computationally heavy for large N
    DBSCAN Density‑based: clusters are dense regions separated by sparse ones Detects irregular shapes and noise Parameter tuning (ε, MinPts) required
Financial Machine Learning · Lecture 2

K-Means Clustering

  • Partitioning a data set into distinct, non-overlapping clusters.
Financial Machine Learning · Lecture 2
  • Let denote sets containing the indices of the observations in each cluster. These sets satisfy two properties:
      1. . In other words, each observation belongs to at least one of the clusters.
      1. for all . In other words, the clusters are nonoverlapping: no observation belongs to more than one cluster.
  • the big idea
    • within-cluster variation is as small as possible

    • within-cluster variation

Financial Machine Learning · Lecture 2
Financial Machine Learning · Lecture 2
  • How to Choose a Method

    Data Characteristic Recommended Method Reason
    Clear cluster boundaries, roughly spherical K‑Means Efficient, stable centroids
    Unknown number of groups, need hierarchy Hierarchical Reveals nested structure
    Irregular shapes or noise present DBSCAN Density‑based robustness
    Very large N*, high‑dimensional Start with MiniBatch K‑Means Scalable approximation
  • Evaluating Cluster Quality

    • Inertia / Within‑Cluster SSE → cohesion measure
    • Silhouette Score ( −1 → 1 ) → how well points fit assigned cluster
    • Elbow Method → visual way to choose K
    • Validation via domain insight → check if clusters make economic sense

    In financial applications, interpretability of clusters is as important as numerical compactness.

Financial Machine Learning · Lecture 2
  • Financial & Economic Relevance

    Domain Example Use Outcome
    Financial Markets Group stocks or investors by return correlation patterns Identify market regimes or style clusters
    Consumer Behavior Segment clients by transaction history Target marketing and credit strategies
    Macro Policy Cluster countries by macro indicators Reveal structural similarity or divergence
Financial Machine Learning · Lecture 2

Clustering in Economics — Market Segmentation Example

  • Objective

    • Demonstrate how clustering helps identify hidden structures in economic or market data.
    • Example: Market Segmentation — grouping companies (or consumers) by behavioral or financial similarity.
    • Goal: obtain distinct, data‑driven “segments” rather than arbitrary categories.

    The cluster structure often reveals latent regimes or business strategies invisible to simple averages.

  • Example Dataset

    Entity Features Used for Clustering Economic Meaning
    Firms (Marketing / Retail) Sales growth, advertising ratio, product diversity, digital channel usage Reflect market behavior and innovation intensity
    Consumers (Finance / Banking) Spending frequency, average transaction size, credit utilization rate Reveal different consumption / risk profiles
    Stocks (Market Data) Average return, volatility, turnover, correlation to index Identify style clusters or behavioral regimes

    Standardization is essential — scale all features to comparable units before clustering.

Financial Machine Learning · Lecture 2
  • Applying K‑Means
    • Workflow

      1. Preprocess data (remove outliers, standardize features).
      2. Choose via Elbow Method or Silhouette Score.
      3. Fit K‑Means to dataset ⇒ obtain cluster assignments.
      4. Profile each cluster → interpret economic meaning.
    • Illustrative Result

      Cluster ID Profile Summary Representative Behavior
      1 High sales & high digital usage Digital Leaders
      2 Moderate growth & traditional channels Conventional Players
      3 Small firms, low marketing spend Niche Survivors
Financial Machine Learning · Lecture 2
  • Economic Interpretation

    • Market segmentation provides actionable insights beyond descriptive averages:
      • Strategy targeting (different pricing or product lines per segment).
      • Investment classification (growth vs. value vs. digital momentum).
      • Policy evaluation (ARE certain market groups lagging or leading?).
    • Clusters form a data‑driven taxonomy, enabling more granular analysis.

    In research terms, clustering can act as an unsupervised labeling mechanism for subsequent models.

  • Extension Ideas

    Direction Purpose in Economic Research
    Cluster stability over time Study structural change or market regime shifts
    Cluster transition matrix Evaluate mobility among behavior types
    Combine with supervised models Use cluster label as explanatory or control variable
    Hybrid approach (K‑Means + DBSCAN) Capture both core groups and fringe anomalies
Financial Machine Learning · Lecture 2

Dimensionality Reduction — PCA and t‑SNE

  • Motivation
    High‑dimensional data are common in econ & finance:
    • Hundreds of macroeconomic indicators
    • Thousands of stock returns or text features
  • Problems:
    • Multicollinearity
    • Redundant information
    • Visualization difficulty
  • Dimensionality Reduction compresses data while preserving its most important variance or structure.

  • Two Main Philosophies: Principal Component Analysis and t‑Distributed Stochastic Neighbor Embedding

    Method Type Key Idea Output Space
    PCA Linear Projection Rotates axes to maximize variance explained Components are linear combos of original variables
    t‑SNE Nonlinear Manifold Learning Preserves local neighbor relationships in low‑D space 2‑D/3‑D embedding suitable for visualization
Financial Machine Learning · Lecture 2
  • Example: hand writing digit recognition (28*28d to 2d)
Financial Machine Learning · Lecture 2
  • Example: human face recognition (64*64d to 3d)
Financial Machine Learning · Lecture 2
  • PCA — Core Mechanism

    1. Standardize variables .
    2. Compute covariance matrix .
    3. Solve the eigenvalue problem.
    4. Sort eigenvectors by eigenvalues → principal components.
  • Example Interpretation:

    • PC1 → “overall economic activity factor”
    • PC2 → “inflation vs. growth trade‑off”
    • Variance explained by first few PCs = data compression efficiency.

PCA uncovers latent orthogonal directions that best summarize the dataset.

Financial Machine Learning · Lecture 2
  • A set of (p-dimensional) features
  • The first principal component

    • is the normalized linear combination of the features that has the largest variance.
    • the loadings of the first principal component:
    • the principal component loading vector,
  • for a specific point :



  • the most imformative direction: :

  • the second principal component

    • maximal variance out of all linear combinations that are uncorrelated with
Financial Machine Learning · Lecture 2
  • Another Interpretation of Principal Components: Principal components provide low-dimensional linear surfaces that are closest to the observations.
Financial Machine Learning · Lecture 2
  • the best -dimensional approximation (in terms of Euclidean distance) to the th observation

  • the optimization problem

  • the smallest possible value of the objective is

  • Principal component loading vectors can give a good approximation to the data when is sufficiently large
Financial Machine Learning · Lecture 2
  • The Proportion of Variance Explained (PVE)
    • The total variance present in a data set is defined as

    • the variance explained by the th principal component is

    • the PVE of the th principal component

Financial Machine Learning · Lecture 2
  • the variance of the data can be decomposed into the variance of the first principal components plus the mean squared error of this -dimensional approximation, as follows:

  • we can interpret the PVE as the of the approximation for given by the first principal components.

Financial Machine Learning · Lecture 2
Financial Machine Learning · Lecture 2
  • Coding: Visualization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

# project from 64 to 2 dimensions
pca = PCA(2)
projected = pca.fit_transform(digits.data)

# visualization
plt.scatter(
  projected[:, 0], projected[:, 1],
  c=digits.target, edgecolor='none', alpha=0.5,
  cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

Financial Machine Learning · Lecture 2

t‑SNE — Intuitive Picture

  • Converts distances into probabilities of similarity.
  • Minimizes Kullback–Leibler divergence between high‑D and low‑D neighborhoods.
  • Excellent for visualizing nonlinear clusters (e.g., consumer patterns, regime shifts).
  • Not suitable for formal inference — mainly exploratory/visual.
Financial Machine Learning · Lecture 2
  • Comparison Summary

    Aspect PCA t‑SNE
    Linear / Nonlinear Linear Nonlinear
    Goal Maximize global variance Preserve local similarities
    Output interpretability High Low (no explicit factors)
    Use case Factor extraction, noise reduction Visual exploration, clustering aid
    Runtime scalability Very fast Slower for large N
Financial Machine Learning · Lecture 2
  • Financial & Economic Applications

    Context How Used Insight
    Macroeconomics Reduce 100 + indicators to a few principal components Identify underlying economic cycles or shocks
    Portfolio Risk Decompose covariance matrix via PCA Reveal dominant risk factors (market, size, sector)
    ESG Analytics Compress dozens of scores to a composite Build interpretable sustainability indices
    Consumer Analysis / Text Data Visualize similarity in spending or opinions Discover behavioral clusters
Financial Machine Learning · Lecture 2

PCA in Finance — Factor Analysis and Risk Decomposition

Motivation
Financial datasets often contain highly correlated variables, e.g. stock returns, factor exposures, or risk indicators.

PCA helps extract a smaller number of latent common factors driving co‑movements across assets.
Typical questions:

  • What are the dominant sources of market variation?
  • How much of portfolio risk comes from each factor?

From Returns to Factors
Given an matrix of asset returns :

  1. Compute covariance matrix .
  2. Eigen‑decomposition:.
  3. Principal components: (time series of factor ).
  4. Variance explained: — proportion of total risk.

Eigenvalues → risk magnitude; Eigenvectors → factor direction.

Financial Machine Learning · Lecture 2
  • Economic Interpretation of PCA Factors

    Principal Component Possible Economic Meaning Typical Pattern
    PC1 Market‑wide factor Explains largest portion of price movement, highly correlated with index.
    PC2 Sector rotation factor Distinguishes cyclical vs. defensive industries.
    PC3 Size or Liquidity factor Captures small‑vs‑large or liquid‑vs‑illiquid contrast.
  • Portfolio Risk Decomposition

    Each term corresponds to the contribution of one principal component to portfolio risk.

    Component Variance Share Interpretation
    PC1 52 % Systematic market risk
    PC2 18 % Sector rotation risk
    PC3+ 30 % Idiosyncratic or noise
Financial Machine Learning · Lecture 2

Association Rule Learning — Market Basket Analysis in Retail

Core Idea

Association Rule Learning discovers co‑occurrence patterns among items or events:

“Which products / actions tend to happen together?”

Originally used in retail (shopping baskets), it has broad economic and financial applications — from consumer analytics to transaction networks and risk event detection.

Example Scenario

In a supermarket dataset of transactions:

Transaction ID Items Purchased
001 Bread, Milk
002 Bread, Diapers, Beer
003 Milk, Diapers, Beer, Cola
004 Bread, Milk, Diapers, Beer
005 Bread, Milk, Cola

Goal → Find rules such as: {Bread, Milk} → {Beer}
which means customers buying Bread and Milk often also buy Beer.

Financial Machine Learning · Lecture 2
  • Rule Metrics

Measure Formula Meaning
Support Frequency of transactions containing both A and B
Confidence Probability of B given A
Lift Strength of association beyond chance ( > 1 = positive correlation )

Example: if Lift = 1.8, customers buying A are 80% more likely to also buy B than average.

  • The Apriori Algorithm

    1. Generate frequent itemsets above a minimum support.
    2. Expand combinations incrementally (breadth first).
    3. Derive high‑confidence association rules.

    Uses the “Apriori Property”: All subsets of a frequent itemset must also be frequent.

    Popular extensions: FP‑Growth, ECLAT for scalability.

Financial Machine Learning · Lecture 2
  • Applications Beyond Retail

    Domain Data Source Insight Gained
    E‑Commerce / Banking Purchase or transaction logs Cross‑selling & recommendation
    Macroeconomics Country macro indicators
    (e.g., inflation ↑ & energy price ↑ → policy tightening)
    Detect co‑movement patterns
    Finance & Risk Fraud or loss event logs Co‑occurring trigger analysis
    Text Analytics Keyword or topic co‑occurrence Identify latent issue linkages
  • Interpretation in Economic Context

    • In behavioral economics: helps understand consumption patterns.
    • In finance: supports product bundling or event correlation analysis
      (e.g., policy shocks co‑occurring with certain market responses).
    • In policymaking: identifies frequent indicator combinations preceding certain outcomes (e.g., inflation ↑ + imports ↓ → recession sign).
Financial Machine Learning · Lecture 2

Anomaly Detection — Fraud and Crisis Early Warning

  • Anomaly (Outlier) = an observation that deviates strongly from the expected pattern of data.
    • In finance and economics, anomalies often signal irregular behavior, fraud, or structural crisis.
    • Tasks of anomaly detection:
      1. Define what is normal (based on data distribution or proximity).
      2. Measure deviation from that normal region.
      3. Flag suspicious or rare events for further analysis.
  • Typical Economic & Financial Contexts

    Domain Anomaly Example Value of Detection
    Banking & Payments Unusual transaction pattern or amount Fraud prevention, AML systems
    Financial Markets Abnormal return or volatility spike Early signal of market stress
    Macroeconomics Sudden divergence in indicators (e.g., credit vs growth) Crisis early warning
    Corporate Finance Unexpected accounting figures Governance & audit inspections

    Unsupervised anomaly detection is crucial when labeled fraud or crisis data are limited.

Financial Machine Learning · Lecture 2
  • Major Approaches

    Approach Mechanism When to Use
    Statistical Thresholding Identify points far from mean ( z‑score , IQR rule ) Small datasets, interpretable
    Distance‑Based Methods Compute nearest neighbors → flag isolated points Moderate datasets, clear metric space
    Density‑Based (DBSCAN / LOF) Low density = outlier Non‑linear structure
    Model‑Based (One‑Class SVM, Isolation Forest) Learn boundary of “normal” region High‑dimensional / complex data

    Isolation Forest Key Idea: randomly partition data; anomalies require fewer splits to isolate.

  • Quantitative Evaluation

    Metric Meaning
    Precision / Recall Trade‑off between missed and false detections
    ROC / PR Curve Evaluates model discrimination if partial labels exist
    Economic Validation Check if flagged anomalies align with known events (e.g., financial crises 2008, COVID shock)
Financial Machine Learning · Lecture 2
  • Economic Interpretation

    • Fraud Detection:
      • Analyze transaction graphs; find users or accounts with rare patterns.
      • Use anomaly score as part of risk indicator dashboard.
    • Crisis Early Warning:
      • Monitor macro indicators under PCA‑reduced space; outliers → potential systemic risk periods.
      • Combine with thresholds on volatility, spread indices, or credit leverage.

    In both cases, anomalies = “weak signals” preceding major events.

  • Hybrid & Practical Systems

    • Hybrid Pipelines: Combine unsupervised detection + rule‑based alerts + expert feedback loop.
    • Temporal Aspect: Rolling‑window anomaly scoring reveals how risk evolves over time.
    • Visualization: heatmaps or dynamic score charts help decision‑makers interpret sudden deviations.
Financial Machine Learning · Lecture 2

Summary — Unsupervised Learning in Economic Research

  • What We Learned

    • Unsupervised learning ≠ prediction — it is about discovery.
    • Throughout this lecture we explored how data—without explicit labels—can reveal structure, patterns, and signals.

    Method Family Main Goal Economic Meaning
    Clustering Group similar observations Market segmentation / structural regime identification
    Dimensionality Reduction (PCA / t‑SNE) Compress information, extract latent components Factor extraction / risk decomposition
    Association Rules Discover co‑occurrence logic Behavior linkages / policy indicator relations
    Anomaly Detection Identify irregular samples Fraud screening / crisis early signals

    The common thread: finding order within apparent randomness.

Financial Machine Learning · Lecture 2

  • Conceptual Integration

    • Pattern Discovery
      • Unsupervised methods uncover the geometry of data — its clusters, directions, and density.
      • These become the foundation for subsequent supervised models or economic interpretation.
    • Data Representation: Techniques like PCA and embedding methods translate messy, overlapping variables into interpretable, compact dimensions.
    • Signal Detection: Anomaly and association approaches generate early signals that traditional econometrics might miss.
  • Advantages in Economic Analysis

    Aspect Value Added by Unsupervised Learning
    Exploratory Power Reveals latent structures before setting hypotheses
    Scalability Handles large, multi‑dimensional datasets
    Adaptivity Works even with limited or no labeled data
    Complementarity Enhances traditional econometric models (e.g., factor analysis, structural breaks)

    Especially valuable in “data‑rich, theory‑light” contexts like finance and policy analytics.

Financial Machine Learning · Lecture 2
  • Methodological Reflection

    • Interpretability Matters: economic meaning must accompany statistical output.
    • No Free Lunch: different algorithms suit different data shapes; validation is essential.
    • Hybrid Modeling: combining unsupervised representation with supervised prediction enhances robustness.
    • Temporal Dimension: tracking clusters or anomaly scores over time reveals dynamics, not just static patterns.
  • Practical Implementation Checklist

    • Data preprocessing — standardize, remove outliers
    • Choose algorithm suited to goal (grouping / reduction / association / detection)
    • Evaluate both statistical fit and economic logic
    • Visualize → Interpret → Validate with domain knowledge

    Always treat the algorithm as a lens, not as the truth itself.

Financial Machine Learning · Lecture 2

PCR - Core Idea

  • Two‑step procedure: “First compress by variance, then regress”
    • Apply Principal Component Analysis (PCA) to : finds axes that capture the largest variance of (data ellipsoid)
    • Use the leading  components as regressors in OLS: restricts regression to the top‑K PCA subspace
  • Key: PCR is blind to Y when building the subspace
  • Intuition:
    • If signal for lies in the high‑variance directions, PCR works well
    • If depends on low‑variance directions, PCR can fail dramatically
Financial Machine Learning · Lecture 2
Financial Machine Learning · Lecture 2

Let be centered; SVD:

  • OLS estimator:

  • PCR with components imposes :

Equivalently, regress on (scores), then map back.

View as projection:

  • PCR solves s.t.
  • Same as applying the projector to β‑space

Bias–variance mechanism:

  • Truncation removes ill‑conditioned directions (small ), reducing variance
  • But introduces bias if removed directions matter for predicting
Financial Machine Learning · Lecture 2

PCR - Algorithm

Step Description
1️ Standardize X
2️ Compute eigenvectors P of Σₓ = X'X / n
3️ Keep first K principal components T = X Pₖ
4️ Regress Y on T
5️ Obtain fitted β = Pₖ βₜ

Choose  via:

  • cross‑validation,
  • cumulative explained variance threshold.
Financial Machine Learning · Lecture 2

PCR - Financial Applications

Area Use of PCR
Asset pricing (factor extraction) Identify latent risk factors (e.g., Connor & Korajczyk 1988)
Term‑structure modelling Extract yield‑curve level / slope / curvature
Macro forecasting Build “big data” factors (Stock & Watson 2002)
Credit risk Summarize correlated firm indicators

Interpretation: PCR emphasizes variance structure, not predictive correlation → for descriptive factor discovery.

Works when:

  • Common components drive both ‑variance and Y (e.g., macro “level/slope”)
  • Factor structure is strong and aligned with predictive signal

Fails when:

  • Predictive signal is “weak‑variance” (e.g., subtle risk‑premium predictors)
  • Measurement noise inflates some variances unrelated to

Takeaway:

  • PCR = Descriptive factor extraction; predictive only if variance aligns with signal
Financial Machine Learning · Lecture 2

PLS - Core Idea

PLS builds components using Y information.

First component:

  • is the score that best “co‑moves” with
  • Subsequent components are computed after deflating and

Interpretation:

  • PLS tilts the axes of towards the direction that predicts
  • Balances capturing ’s structure and maximizing predictive covariance
Financial Machine Learning · Lecture 2
Financial Machine Learning · Lecture 2

PLS — Core Idea (Algebraic & Algorithmic)

Component-wise (NIPALS/SIMPLS idea):

  1. ,
  2. Regress y on t → q
  3. Deflate: ,
  4. Iterate to components

Closed‑form model after components:

with (weights), (‑loadings), (‑loadings)

Krylov subspace view:

  • PLS restricts to this “predictive” subspace
  • As increases to , PLS → OLS

Financial Machine Learning · Lecture 2

PLS vs PCR

Geometry

  • PCR: choose axes by (unsupervised)
  • PLS: choose axes by (supervised)
  • With small, PLS often captures predictive structure faster than PCR
  • With , both recover OLS (in noiseless algebra), but PLS reaches useful predictors earlier

Link to related methods:

  • PLS vs CCA: CCA maximizes correlation with constraints on both sides; PLS maximizes covariance and builds sequential predictive scores

Shrinkage profiles:

  • PCR: hard truncation on small singular values (discard directions)
  • Ridge: continuous shrinkage ()
  • PLS: data‑adaptive shrinkage tied to X^\top y (non‑monotone in general)

Implication:

  • PLS can retain some low‑variance yet predictive directions
  • PCR cannot, unless those directions appear among top‑variance PCs
Financial Machine Learning · Lecture 2

Practical Implications in Finance

  • Asset‑pricing factors:

    • PCR → structural factors (level/slope/curvature) for interpretation
    • PLS → predictive factors for returns/risk premia
  • Macro forecasting:

    • PCR → “big data” indices summarizing broad variation
    • PLS → indices targeted to forecast a specific target (e.g., inflation, excess returns)
  • Credit risk:

    • PLS often outperforms when many correlated ratios weakly predict PD
Financial Machine Learning · Lecture 2

Choosing (Both PCR and PLS)

  • Cross‑validation on forecasting loss (out‑of‑sample R², MSPE)
  • Information criteria adapted to components (BIC on regression with T)
  • Stability checks across subsamples/horizons
  • For time series: align y_{t+h} with X_t; beware serial dependence in CV folds

Preprocessing:

  • Always standardize X (and y if multi‑response)
  • Consider de‑meaning by firm/time fixed effects when appropriate
Financial Machine Learning · Lecture 2

PLS – Intuition and Comparison with PCR


Feature PCR PLS
Uses  info to extract components No Yes
Objective
Predictive power Medium Higher
Interpretability High Moderate
Typical goal Data summarization Forecasting

In financial econometrics PLS often yields factors that better forecast returns or macro variables.

Financial Machine Learning · Lecture 2

PCR vs. PLS vs. Regularized Methods


Method Dimension Reduction 
Mechanism
Uses 
Info
Variable
Selection
Interpretability Prediction
Accuracy
PCR PCA on  Moderate
PLS Medium High
LASSO  penalty (shrink & select) Sparse High
Ridge  penalty (shrink only) Stable Medium
Elastic Net  combined Medium High
Financial Machine Learning · Lecture 2

Empirical Guidelines for Method Selection

Research Goal Data Structure Recommended 
Method
Reason
Explain structural relations Moderate dimensionality PCR Captures underlying data structure
Forecast  with  High collinearity PLS Uses  to guide factor extraction
Feature selection / large p Sparse relevant signals LASSO / Elastic Net Automatic variable selection
Stable estimates / collinearity p ≈ n large Ridge / PLS Shrinkage stabilizes estimates
Mixed objectives 
(explain + predict)
High‑dimensional, noisy  Hybrid PLS + Regularization Emerging trend in finance


Financial Machine Learning · Lecture 2

Summary — Core Ideas

  • PCR: project β onto top‑variance subspace of X; great for structure, risky for prediction if signal ≠ variance
  • PLS: sequentially extract X‑directions that maximize covariance with y; typically more predictive with small K
  • Both reduce variance via dimension control; PLS uses Y to steer the subspace

Rule of thumb:

  • Want interpretable structure → PCR
  • Want early predictive power with few components → PLS
  • If p ≫ n and selection matters, compare with LASSO/EN; hybrid PLS + regularization is a strong baseline
Financial Machine Learning · Lecture 2

Summary and Readings

  • Summary · Lecture 02

    Topic Essence Financial Applications
    Regression Linear + Regularized models for continuous targets Return & risk prediction
    Classification Binary decisions based on features Credit scoring, fraud detection
    Tree‑Based Models Ensemble methods (GBM, RF) for accuracy & interpretability PD modeling, risk rating
    Unsupervised Learning Clustering & PCA to find hidden patterns Regime analysis, factor extraction

    Shallow learning forms the foundation for later Deep Learning methods.

  • Recommended Readings

    • Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Jonathan Taylor. An Introduction to Statistical Learning with Applications in Python[M]. Springer Cham, 2023.
    • Murphy K P. Probabilistic machine learning: an introduction[M]. MIT press, 2022.
    • Gaillac C, L'Hour J. Machine Learning for Econometrics[M]. Oxford University Press, 2025.
Financial Machine Learning · Lecture 2

Final Takeaways

  • 浅层学习算法构成了金融预测与决策的核心基础。
  • 回归、分类、树模型与无监督学习共同支撑“结构化分析 + 可解释建模”。
  • 下一讲:Deep Learning and Representation Learning in Finance
      → 利用 CNN、RNN 与 自编码模型 处理时序与非结构化金融数据。
Financial Machine Learning · Lecture 2

- Ridge tends to keep all variables but small. - LASSO performs explicit feature selection. - Elastic Net balances bias–variance trade‑off.

> Example: combining a macro‑theory model with a data‑driven model to improve forecast stability.

> Together, these methods form a **continuum** from *variance control* → *bias correction* → *model diversification*.

> Think of unsupervised learning as *asking data to tell its own story.*

> It answers the question: “Who looks like whom?”

- **K‑Means**: partitions space around centroids. - **Hierarchical**: builds a dendrogram; you "cut" it at a chosen level. - **DBSCAN**: groups dense points, labels sparse outliers as noise.