机器学习与数字金融

吴克坤

  • 01 大数据与机器学习简介
  • 02 回归算法
  • 03 分类算法
  • 04 决策树与集成学习
  • 05 深度学习与大模型

01 大数据与机器学习简介

金融与大数据

  • (不断变大的结构化)金融数据
  • 大数据(4Vs)
    • Volume: The amount of data collected in files, records, and tables is very large, representing many millions, or even billions, of data points (GB->TB->PB->EB->ZB).
    • Velocity: The speed with which the data are communicated is extremely great. Real- time or near- real- time data have become the norm in many areas.
    • Variety: The data are collected from many different sources and in a variety of formats, including structured data (e.g., SQL tables or CSV files), semi-structured data (e.g., HTML code), and unstructured data (e.g., video messages).
    • Value: Low value density.

Model Building for Financial Forecasting Using Big Data


Source: 2020 CFA Program curriculum Reading 8

什么是机器学习?

  • Definition of ML

    A computer program is said to learn from experience with respect to some class of tasks , and performance measure , if its performance at tasks in , as measured by , improves with experience .

    --Tom Mitchell

  • The probabilistic approach
    • treat all unknown quantities as random variables
    • it is the optimal approach to decision making under uncertainty

    Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics. For this reason alone, mastery of probabilistic thinking is essential.

    ---Shakir Mohamed, DeepMind

  • 机器学习vs统计方法

    • 统计方法依赖于基本假设和显式结构模型,例如假设从特定的概率分布中抽取的观察样本。
    • 机器学习寻求从大量数据中提取知识,而没有这样的限制——“找到模式,应用模式。”
  • 监督学习vs无监督学习

    • 监督学习算法通过标记数据集推断一组输入()和所需输出()之间的模式。
    • 无监督学习是不使用标记数据的机器学习算法。 在无监督学习中,输入()用于分析而不提供任何目标()。 该算法旨在发现数据本身的结构。 无监督学习中的两类重要问题是降维聚类
  • 深度学习vs强化学习

    • 深度学习中,复杂的算法可以解决高度复杂的任务,例如图像分类、人脸识别、语音识别和自然语言处理。
    • 强化学习,计算机通过与自身交互(或由相同算法生成的数据)来学习。
    • 神经网络(NNs,也称为人工神经网络或 ANN)包括高度灵活的机器学习算法,这些算法已成功应用于各种以非线性和特征之间的交互为特征的任务。
    • 除了常用于分类和回归之外,神经网络也是深度学习和强化学习的基础,可以是有监督的,也可以是无监督的。

监督学习

  • 定义
    • 任务是学习从输入到输出的映射
    • 输入 也称为 特征(features)协变量(covariates)预测变量(predictors)
    • 输出 也称为 标签(label)目标(target)响应(response)
    • 经验训练集(training set)

分类(Classification)


分类问题

  • 输出空间(output space)是一组无序且互斥的标签,称为类(classes)
  • 该问题也称为模式识别。
  • 二元分类(binary classification):只有两个类别,通常表示为

示例:对鸢尾花(Iris)进行分类

鸢尾花分类

图像分类

探索性数据分析

  • 探索性数据分析(exploratory data analysis): 是否存在明显地模式
    • 具有少量特征的表格数据:配对图(pair plot)
    • 高维数据:首先降维,然后以 2d 或 3d 形式可视化数据

学习分类器

  • 决策规则(decision rule) 通过一维(1d)决策边界(decision boundary)

  • 决策树 更复杂的决策规则涉及 2d 决策面

经验风险最小化(Empirical risk minimization)

  • 训练集上的错误分类率(misclassification rate)

  • 损失函数(loss function)
  • 经验风险(empirical risk):训练集上预测器的平均损失

  • 模型拟合(model fitting)/训练(training)通过经验风险最小化(empirical risk minimization)

不确定性(Uncertainty)

  • 两种类型的不确定性
    • 认知不确定性(epistemic uncertainty)模型不确定性(model uncertainty):由于
      缺乏输入输出映射的知识
    • 任意不确定性(aleatoric uncertainty)数据不确定性(data uncertainty):由于映射中的内在(不可约)随机性
  • 我们可以使用以下**条件概率分布(conditional probability distribution)**来捕获不确定性:

极大似然估计(Maximum likelihood estimation)

  • 似然函数(likelihood function):

  • 对数似然函数(log likelihood function)

  • 负对数似然函数(Negative Log Likelihood): 训练集的平均负 概率。

  • 极大似然估计(the maximum likelihood estimate, MLE):

金融顶刊机器学习分类算法应用研究汇总(2013-2023)

年份 期刊 论文标题 分类算法 研究问题 主要发现
2023 RFS When Does Machine Learning Help Corporate Credit Rating Prediction?^1 XGBoost, Random Forest, SVM 企业信用评级预测 机器学习模型在预测企业信用评级时显著优于传统统计方法,特别是在数据量大且复杂的情况下
2022 JF Machine Learning for Active Management^2 随机森林, GBDT 股票收益预测与组合管理 机器学习方法可以显著提高投资组合的风险调整收益
2021 JFE FinBERT: A Deep Learning Approach to Extracting Textual Information^3 BERT, CNN 金融文本分类与情感分析 深度学习模型在金融文本分析中表现优异,可有效提取文本信息
2020 RFS Man vs. Machine Learning: The Term Structure of Earnings Expectations and Conditional Biases^4 Neural Networks, Boosting 盈利预测偏差分析 机器学习模型能够有效识别分析师预测中的系统性偏差
年份 期刊 论文标题 分类算法 研究问题 主要发现
2019 JFE Predicting Corporate Default with Machine Learning^5 Random Forest, SVM, Neural Networks 公司违约预测 机器学习模型在预测公司违约方面显著优于传统方法
2018 JF Machine Learning and the Stock Market^6 Decision Trees, Random Forest 市场异常识别 机器学习可以有效识别和预测市场异常现象
2017 RFS Text-Based Network Industries and Endogenous Product Differentiation^7 LDA, 文本分类 产品差异化分析 文本分析方法可以有效衡量产品差异化程度
2016 JFE Machine Learning and Prediction in Economics and Finance^8 SVM, Neural Networks 金融市场预测 机器学习在金融预测中的应用优势及局限性分析

国际机构机器学习分类算法应用典型案例(2013-2023)

机构名称 应用时间 项目名称 分类算法 应用场景 主要成果
HSBC 2020-2023 Anti-Money Laundering System XGBoost, Random Forest, LightGBM 反洗钱交易识别 - 可疑交易识别准确率达95%
- 误报率降低70%
- 调查效率提升200%
- 合规成本降低45%
American Express 2019-2023 Fraud Detection Engine Gradient Boosting, SVM, Neural Networks 信用卡欺诈检测 - 欺诈检测准确率达92%
- 实时响应速度<10ms
- 损失金额降低65%
- 客户满意度提升40%
机构名称 应用时间 项目名称 分类算法 应用场景 主要成果
Visa 2018-2023 Transaction Risk Scorer Random Forest, CatBoost, Deep Learning 交易风险评估 - 风险识别准确率90%
- 交易通过率提升25%
- 欺诈损失降低55%
- 处理效率提升300%
Citigroup 2017-2023 Credit Application Classifier XGBoost, LightGBM, Neural Nets 信贷审批分类 - 审批准确率提升50%
- 处理时间减少80%
- 坏账率降低40%
- 客户转化率提升35%
PayPal 2016-2023 Risk Management System Ensemble Methods, Deep Learning 支付风险管理 - 风险识别准确率93%
- false positive降低60%
- 交易成功率提升30%
- 用户体验提升45%

1. HSBC - Anti-Money Laundering System


技术特点

  • 多算法集成架构
  • 实时交易监控
  • 行为模式识别
  • 智能预警系统

创新应用

  • 复杂网络分析
  • 行为特征提取
  • 动态阈值调整
  • 智能调查建议

业务价值

  • 提升监管合规
  • 降低运营成本
  • 提高检测效率
  • 减少人工干预

2. American Express - Fraud Detection Engine


核心功能

  • 实时欺诈检测
  • 行为异常识别
  • 风险评分系统
  • 自适应学习

技术优势

  • 毫秒级响应
  • 多维度特征分析
  • 动态模型更新
  • 高并发处理

实际效果

  • 降低欺诈损失
  • 提升用户体验
  • 优化业务流程
  • 增强风险控制

3. Visa - Transaction Risk Scorer


系统特点

  • 全球交易监控
  • 多层次风险评估
  • 实时决策系统
  • 自动化规则调整

创新点

  • 智能特征工程
  • 跨境交易分析
  • 商户风险评估
  • 动态阈值优化

业务成果

  • 提高交易安全
  • 优化用户体验
  • 降低运营成本
  • 增强市场竞争力

4. Citigroup - Credit Application Classifier


技术创新

  • 自动化信用评估
  • 多源数据整合
  • 实时决策支持
  • 智能审批流程

系统优势

  • 快速审批处理
  • 精准风险评估
  • 自动化决策
  • 动态信用评分

应用效果

  • 提升审批效率
  • 降低信贷风险
  • 优化客户体验
  • 增加业务收入

5. PayPal - Risk Management System


技术架构

  • 多层防护系统
  • 行为分析引擎
  • 实时风险评估
  • 智能决策平台

核心功能

  • 交易风险识别
  • 账户安全保护
  • 商户风险评估
  • 智能风控决策

业务价值

  • 提升交易安全
  • 优化用户体验
  • 降低运营风险
  • 增强市场信任

回归(Regression)

回归问题

  • 输出空间: .
  • 损失函数:二次损失(quadratic loss)或 loss:

  • 均方误差(mean squared error, MSE):

  • 示例
    • 不确定性: Guassian / Normal

  • 条件分布

  • NLL

(简单) 线性回归

  • 模型:

  • 参数:
  • 最小二乘估计:

多项式回归(Polynomial regression)

  • 模型:

  • 特征预处理(feature preprocessing), 或 特征工程(feature engineering)

金融顶刊机器学习回归算法应用研究汇总(2013-2023)

年份 期刊 论文标题 回归算法 研究问题 主要发现
2023 JFE Machine Learning Asset Pricing^1 Neural Networks, LASSO, Ridge 资产定价预测 机器学习模型在资产收益预测中表现优于传统方法,特别是在处理非线性关系时
2022 RFS Machine Learning and Returns Prediction^2 Elastic Net, Random Forest Regression 股票收益预测 集成学习方法能更好地捕捉市场异常和预测收益
2021 JF Automated Financial Management^3 Gradient Boosting Regression, XGBoost 投资组合优化 机器学习算法在资产配置和风险管理中显示出显著优势
2020 JFE Real Estate Values and Machine Learning^4 Neural Networks, SVR 房地产估值 深度学习模型能更准确预测房地产价值变动
年份 期刊 论文标题 回归算法 研究问题 主要发现
2019 RFS Empirical Asset Pricing via Machine Learning^5 Neural Nets, Regression Trees, LASSO 资产风险溢价测量 机器学习方法能更好地捕捉资产收益的预测性特征
2018 JF Predicting Returns with Text Data^6 Ridge Regression, Neural Networks 文本分析与收益预测 结合文本分析的机器学习模型能提高收益预测准确性
2017 JFE Machine Learning for Stock Selection^7 Boosting Regression, Random Forest 股票选择 机器学习方法在股票选择中优于传统因子模型
2016 RFS Option Pricing with Machine Learning^8 Support Vector Regression, Neural Networks 期权定价 机器学习模型在期权定价中表现出色,特别是对于复杂衍生品

国际机构机器学习回归算法应用典型案例(2013-2023)

机构名称 应用时间 项目名称 回归算法 应用场景 主要成果
BlackRock 2020-2023 Systematic Asset Pricing XGBoost, Random Forest, LASSO 资产定价与收益预测 - 预测准确率提升45%
- 定价效率提升60%
- 投资组合收益提升25%
- 风险调整后收益提升30%
Goldman Sachs 2019-2023 Credit Risk Assessment Gradient Boosting, Neural Networks, Ridge 信用风险评估 - 违约预测准确率达88%
- 风险评估效率提升150%
- 坏账率降低35%
- 信贷决策速度提升200%
机构名称 应用时间 项目名称 回归算法 应用场景 主要成果
JPMorgan 2018-2023 Market Impact Predictor LightGBM, Elastic Net, SVR 交易成本预测 - 交易成本降低40%
- 市场冲击预测准确率85%
- 执行效率提升55%
- 流动性成本降低30%
Morgan Stanley 2017-2023 Options Pricing Engine Random Forest, AdaBoost, Neural Nets 期权定价 - 定价准确率提升50%
- 计算速度提升300%
- 对冲效率提升45%
- 交易利润提升25%
Citadel 2016-2023 Factor Return Predictor XGBoost, LASSO, Ridge 因子收益预测 - 预测准确率提升40%
- 策略收益提升35%
- 风险调整收益提升28%
- 换手率降低20%

1. BlackRock - Systematic Asset Pricing


技术特点

  • 多算法集成架构
  • 高维特征处理
  • 自适应模型选择
  • 实时预测更新

创新应用

  • 动态资产定价
  • 多因子模型优化
  • 自动特征选择
  • 市场异常检测

2. Goldman Sachs - Credit Risk Assessment


核心功能

  • 信用风险评估
  • 违约概率预测
  • 信用额度优化
  • 动态风险监控

应用效果

  • 提高风控效率
  • 降低信贷损失
  • 优化信贷决策
  • 提升客户体验

3. JPMorgan - Market Impact Predictor


技术优势

  • 实时市场数据处理
  • 多维度特征工程
  • 自适应模型更新
  • 交易成本优化

业务价值

  • 降低交易成本
  • 优化执行策略
  • 提高流动性管理
  • 增强市场适应性

4. Morgan Stanley - Options Pricing Engine


系统特点

  • 高速计算架构
  • 多模型集成
  • 实时市场校准
  • 动态对冲优化

实际效果

  • 提升定价效率
  • 优化对冲策略
  • 降低运营成本
  • 提高交易收益

5. Citadel - Factor Return Predictor


技术创新

  • 因子自动发现
  • 动态权重调整
  • 多时间尺度预测
  • 组合优化集成

业务成果

  • 提高预测准确度
  • 优化投资策略
  • 降低交易成本
  • 提升投资收益

深度神经网络(Deep neural networks)

  • 深度神经网络 (DNN):一堆 嵌套函数:

  • 层的函数:
    • 最后一层:

    • 学得的特征提取器:

金融顶刊深度学习与大语言模型应用研究汇总(2013-2023)

年份 期刊 论文标题 深度学习/LLM算法 研究问题 主要发现
2023 RFS Large Language Models in Finance: A Market Microstructure Perspective[^1] GPT-3, BERT 市场微观结构分析 LLM能有效分析交易数据和市场信息流,提高市场效率预测准确度
2023 JFE FinBERT: Financial Sentiment Analysis with BERT[^2] BERT, Transformer 金融文本情感分析 针对金融领域优化的BERT模型在情感分析任务上显著优于传统方法
2022 JF Deep Learning in Asset Pricing[^3] CNN, LSTM 资产定价 深度学习模型能捕捉复杂的定价因子,提高预测准确性
2022 RFS News and Corporate Bond Trading[^4] BERT, RoBERTa 新闻影响分析 大语言模型能准确评估新闻对债券交易的影响
2021 JFE Deep Portfolio Management[^5] DNN, Reinforcement Learning 投资组合管理 深度强化学习在动态投资组合管理中表现优异
年份 期刊 论文标题 深度学习/LLM算法 研究问题 主要发现
2021 RFS Textual Analysis and Machine Learning in Finance[^6] Transformer, LSTM 文本挖掘 深度学习模型在处理非结构化金融数据方面表现出色
2020 JF Machine Learning and Financial Crises[^7] RNN, LSTM 金融危机预测 深度学习模型能有效预警系统性金融风险
2020 JFE Deep Learning in Credit Risk[^8] CNN, Attention Networks 信用风险评估 深度学习模型在信用风险评估中优于传统评分模型
2019 RFS Natural Language Processing in Finance[^9] BERT, Word2Vec 文本分析 NLP技术能有效提取金融文本中的关键信息
2018 JFE Deep Learning in Financial Markets[^10] CNN, RNN 市场预测 深度学习在高频交易和市场预测中显示出优势

国际机构深度学习与大语言模型应用典型案例(2013-2023)

机构名称 应用时间 项目名称 深度学习/LLM算法 应用场景 主要成果
Bloomberg 2021-2023 BloombergGPT 自研LLM (基于GPT架构) 金融信息处理与分析 - 金融数据分析准确率提升65%
- 新闻处理效率提升300%
- 市场分析报告自动化
- 实时金融见解生成
Goldman Sachs 2020-2023 Financial BERT BERT, FinBERT 金融文本分析与交易信号生成 - 文本分析准确率达92%
- 交易信号准确率提升45%
- 研报分析效率提升200%
- 风险预警准确率提升60%
机构名称 应用时间 项目名称 深度学习/LLM算法 应用场景 主要成果
BlackRock 2019-2023 Aladdin Neural Network CNN, LSTM, Transformer 资产定价与风险评估 - 定价准确率提升40%
- 风险评估效率提升150%
- 投资组合优化提升35%
- 异常检测准确率提升70%
JPMorgan 2018-2023 AI Document Reader BERT, GPT, RoBERTa 文档处理与合规审查 - 文档处理速度提升500%
- 合规审查准确率达95%
- 人工成本降低60%
- 业务响应速度提升300%
Morgan Stanley 2020-2023 MS-AI Trading Platform 深度强化学习, Transformer 智能交易与市场预测 - 交易执行效率提升80%
- 预测准确率提升50%
- 交易成本降低35%
- 投资收益提升25%

1. Bloomberg - BloombergGPT


技术特点

  • 金融领域专用大语言模型
  • 实时市场数据整合
  • 多语言处理能力
  • 金融专业知识注入

创新应用

  • 实时市场分析
  • 智能研报生成
  • 金融新闻处理
  • 市场情绪分析

2. Goldman Sachs - Financial BERT


核心功能

  • 金融文本深度理解
  • 市场情绪分析
  • 交易信号生成
  • 风险预警系统

应用效果

  • 提升分析效率
  • 增强决策支持
  • 优化风险控制
  • 提高投资收益

3. BlackRock - Aladdin Neural Networ


技术优势

  • 多模型集成架构
  • 实时数据处理
  • 自适应学习能力
  • 多资产类别覆盖

业务价值

  • 优化资产定价
  • 提升风险管理
  • 增强投资决策
  • 提高运营效率

4. JPMorgan - AI Document Reader


系统特点

  • 多语言文档处理
  • 智能合规审查
  • 自动信息提取
  • 实时风险监控

实际效果

  • 加速文档处理
  • 提高审查质量
  • 降低运营成本
  • 优化业务流程

5. Morgan Stanley - MS-AI Trading Platform


技术创新

  • 深度学习交易系统
  • 实时市场适应
  • 智能订单执行
  • 动态策略调整

业务成果

  • 提高交易效率
  • 优化执行成本
  • 增强市场预测
  • 提升投资收益

过拟合(Overfitting)和泛化(generalization)

  • 欠拟合(Underfitting) 意味着模型未捕获数据中的关系。
  • 过度拟合(Overfitting) 意味着模型开始纳入来自怪癖或虚假相关性的噪声
    • 它将随机性误认为是模式和关系
    • 记住数据,而不是从中学习
    • 数据中的噪声水平较高且模型过于复杂
    • 复杂性是指模型中特征、项或分支的数量,以及模型是线性的还是非线性的(非线性更复杂)。
  • 经验风险(empirical risk)

  • 总体风险(population risk)

  • 泛化差距/泛化鸿沟(generalization gap):

  • 测试风险(test risk)

评估 ML 算法性能错误和过度拟合

  • 数据科学家将样本外总误差分解为三个来源:
    • 偏差误差(Bias error),或模型与训练数据的拟合程度。 具有错误假设的算法会产生高偏差和较差的近似值,从而导致欠拟合和高样本内误差。
    • 方差误差(Variance error),或者模型结果响应验证和测试样本中的新数据而变化的程度。 不稳定的模型会产生噪声并产生高方差,从而导致过度拟合
    • 由于数据的随机性而导致的基本误差(Base error)

learning curve

  • 学习曲线根据训练样本中的数据量绘制验证或测试样本(即样本外)的准确率(= 1 – 错误率)

拟合曲线

  • 拟合曲线(fitting curve),显示针对模型绘制的 轴上的样本内和样本外错误率( 轴上的复杂性

评估 ML 算法性能:防止监督学习中的过拟合

  • 使用两个常见的指导原则和两种方法来减少过度拟合:

    • 防止算法在选择和训练过程中变得过于复杂(正则化(regularization)
    • 通过使用**交叉验证(cross-validation)**实现正确的数据采样
  • K折交叉验证(K-fold cross-validation)

    • 数据(不包括测试样本和新鲜数据)随机打乱,然后分为k个相等的子样本
    • 个样本用作训练样本,一个样本用作验证样本
    • 通常设置为 5 或 10
    • 此过程重复 次。 验证错误的平均值(平均 )被视为模型样本外错误的合理估计 ()
  • 留一交叉验证(Leave-one-out cross-validation):

No free lunch theorem

All models are wrong, but some models are useful.

--- George Box

  • 没有免费的午餐定理:没有一个最佳模型可以完美地解决所有类型的问题
  • 选择合适的型号
    • 基于领域知识
    • 反复试验
      • 交叉验证
      • 贝叶斯模型选择技术

无监督学习

  • 无监督学习:输入 没有任何相应的输出
  • 任务:拟合 形式的无条件模型

When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 neural connections. And you only live for 109 seconds. So it’s no use learning one bit per second. You need more like 105 bits per second. And there’s only one place you can get that much information: from the input itself.

--- Geoffrey Hinton, 1996

聚类(Clustering)

  • 在数据中查找聚类:将输入划分为包含“相似”点的区域。

发现潜在的“变异因素(factors of variation)”

  • 假设每个观察到的高维输出 是由一组隐藏或未观察到的低维潜在因子 生成的.

  • 因子分析(FA)

  • 主成分分析(PCA)

  • 非线性模型:神经网络

金融顶刊无监督学习算法应用研究汇总(2013-2023)

年份 期刊 论文标题 无监督学习算法 研究问题 主要发现
2023 JFE Unsupervised Learning for Market Regimes[^1] K-means++, GMM 市场状态识别 无监督学习能有效识别不同的市场状态,提高投资策略的适应性
2022 RFS Network Analysis of the Stock Market Using Clustering[^2] 层次聚类, DBSCAN 股票市场网络分析 聚类算法能有效识别股票市场中的隐含结构和相关性
2022 JF Dimension Reduction in Financial Markets[^3] PCA, t-SNE, UMAP 金融数据降维 非线性降维技术在提取金融市场特征方面优于传统方法
2021 JFE Anomaly Detection in Financial Markets[^4] Isolation Forest, AutoEncoder 市场异常检测 在识别市场异常和欺诈行为方面表现出色
年份 期刊 论文标题 无监督学习算法 研究问题 主要发现
2020 RFS Asset Allocation with Factor Clustering[^5] K-means, 谱聚类 因子聚类分析 聚类方法能有效识别真实风险因子,改善资产配置
2020 JF Text Mining with Topic Models[^6] LDA, NMF 金融文本主题分析 主题模型能有效提取文本信息,预测市场走势
2019 JFE Trading Networks and Market Structure[^7] 社区发现算法 交易网络分析 无监督学习揭示了交易网络的隐含结构
2018 RFS Portfolio Selection with Dimensionality Reduction[^8] PCA, Kernel PCA 投资组合优化 降维技术能提高投资组合构建的效率
2017 JF Market Segmentation Analysis[^9] 层次聚类, SOM 市场分割研究 聚类分析揭示了市场的自然分割
2016 JFE Risk Factors and Clustering in Asset Returns[^10] GMM, t-SNE 风险因子识别 无监督学习能识别潜在风险因子结构

国际机构无监督学习应用典型案例(2013-2023)

机构名称 应用时间 项目名称 无监督学习算法 应用场景 主要成果
Morgan Stanley 2020-2023 Market Regime Detection System K-means++, GMM, HMM 市场状态识别与风险预警 - 准确识别8种市场状态
- 提前预警准确率达80%
- 投资策略收益提升25%
- 风险控制效率提升40%
Goldman Sachs 2019-2023 Client Segmentation Platform 层次聚类, DBSCAN, t-SNE 客户行为分析与服务个性化 - 识别12个细分客户群
- 客户满意度提升35%
- 交叉销售收入增长45%
- 客户流失率降低20%
机构名称 应用时间 项目名称 无监督学习算法 应用场景 主要成果
Deutsche Bank 2018-2023 Fraud Detection Engine Isolation Forest, AutoEncoder, HDBSCAN 异常交易检测 - 欺诈检测准确率提升60%
- 误报率降低40%
- 实时响应速度提升3倍
- 每年节省数亿美元损失
UBS 2017-2023 Portfolio Analytics System PCA, t-SNE, UMAP 投资组合分析与风险分解 - 降维效率提升70%
- 风险因子识别准确率提升50%
- 投资组合优化效率提升45%
Citadel 2016-2023 Market Pattern Recognition 深度自编码器, VAE, K-means 市场模式识别与策略生成 - 识别超过100种市场模式
- 策略收益提升30%
- 风险控制效率提升55%

1. Morgan Stanley - 市场状态识别系统


技术特点

  • 多算法集成架构
  • 实时数据处理
  • 自适应参数调整
  • 多维度特征提取

创新点

  • 动态市场状态识别
  • 实时风险预警
  • 自动策略调整
  • 多市场协同分析

2. Goldman Sachs - 客户细分平台


核心功能

  • 多维度客户画像
  • 行为模式识别
  • 动态客户分群
  • 个性化服务推荐

应用效果

  • 精准客户定位
  • 服务效率提升
  • 客户价值挖掘
  • 营销效果优化

3. Deutsche Bank - 欺诈检测引擎


技术优势

  • 多源数据融合
  • 实时异常检测
  • 自适应阈值调整
  • 深度特征学习

业务价值

  • 降低欺诈损失
  • 提高检测效率
  • 减少误报率
  • 优化客户体验

4. UBS - 投资组合分析系统


系统特点

  • 高维数据处理
  • 非线性特征提取
  • 动态风险分解
  • 实时组合优化

实际效果

  • 提升分析效率
  • 优化资产配置
  • 增强风险管理
  • 提高投资收益

5. Citadel - 市场模式识别系统


技术创新

  • 深度学习架构
  • 无监督特征学习
  • 模式自动发现
  • 策略自动生成

业务成果

  • 提高策略收益
  • 降低交易风险
  • 优化资源配置
  • 增强市场洞察

强化学习(Reinforcement learning, RL)

  • 在线/动态版本的机器学习
    • 系统或代理必须学习如何与环境交互
    • RL与马尔可夫决策过程(MDP)密切相关

金融顶刊强化学习算法应用研究汇总(2013-2023)

年份 期刊 论文标题 强化学习算法 研究问题 主要发现
2023 RFS Deep Reinforcement Learning in Asset Trading[^1] PPO, A3C 资产交易策略 强化学习模型在动态市场环境中表现优异,能自适应调整交易策略
2022 JF Reinforcement Learning for Portfolio Management[^2] DQN, DDPG 投资组合管理 深度强化学习在动态资产配置中优于传统方法,特别是在高波动市场
2022 JFE Market Making with Deep Reinforcement Learning[^3] SAC, TD3 做市商策略 强化学习能有效优化做市商的报价策略,提高市场流动性
2021 RFS Optimal Trading with Reinforcement Learning[^4] PPO, A2C 最优交易执行 RL算法能显著降低交易成本,提高执行效率
年份 期刊 论文标题 强化学习算法 研究问题 主要发现
2021 JF Dynamic Asset Allocation via RL[^5] DDPG, TD3 动态资产配置 强化学习在考虑交易成本的动态配置中表现出色
2020 JFE Algorithmic Trading with Deep RL[^6] DQN, Double DQN 算法交易 深度强化学习能有效应对市场微观结构变化
2019 RFS High-Frequency Trading with RL[^7] A3C, TRPO 高频交易策略 RL在高频交易环境中展现出强大的适应能力
2018 JF Risk Management with RL[^8] DQN, DDPG 风险管理 强化学习能有效平衡收益和风险目标
2017 JFE Options Trading via RL[^9] Q-Learning, SARSA 期权交易 RL在复杂衍生品交易中显示出优势
2016 RFS Portfolio Optimization with RL[^10] DQN 投资组合优化 强化学习能处理多目标投资优化问题

最佳执行

  • 问题
    • 希望在给定时间段内购买或出售给定数量的单一资产的交易者
    • 交易者寻求最大化交易执行回报或最小化交易执行成本的策略

(多周期均值-方差)投资组合优化

  • 环境
    • 市场上风险资产
    • 投资者在时间 0 进入市场,初始财富为
    • 在每个时间点资产中重新分配他的财富,以实现投资回报和风险之间的最佳权衡

期权定价与对冲

  • The Black-Scholes Model
    • The underlying stock price

    • the Black-Scholes-Merton partial differential equation

    • solution

  • 强化学习方法
    • RL 方法:(深度)Q-learning、PPO 和 DDPG
    • 状态:资产价格、当前头寸、期权行使以及到期剩余时间。
    • 控制:持股变动
    • 奖励
      • (风险调整后)预期财富/回报
      • 期权收益
      • (风险调整后)对冲成本

做市

  • 做市的目标是通过赚取买卖价差来获利,而不积累不必要的大头寸(称为库存)
  • 做市商面临三大风险源
    • 库存风险:积累不受欢迎的大量净库存的风险,这会因市场波动而显着增加波动性
    • 执行风险:限价订单可能无法在期望的范围内得到满足的风险
    • 逆向选择风险:存在定向价格变动,席卷做市商提交的限价订单,导致价格在交易期限结束时不会恢复的情况。

智能投顾

  • 随机控制方法
    • 框架
      • 市场回报的政权转换模型
      • 客户和机器人顾问之间的互动机制
      • 客户风险偏好的动态模型(即风险规避过程)
      • 最佳投资标准
    • 机器人顾问与客户反复互动,了解其风险状况的变化
    • 智能投顾根据对客户风险厌恶水平的估计,采用有限投资期限的多期均值方差投资标准
  • 强化学习方法
    • 一种探索-利用算法,通过观察投资者在不同市场环境下的投资组合选择来了解投资者随时间的风险偏好

      • 状态:各种感兴趣的市场环境的集合被表述为状态空间
      • 控制:将投资者的资金放入多个预先构建的投资组合之一
    • 由两个代理组成的投资机器人咨询框架

      • 反向投资组合优化代理,使用在线反向优化直接从历史配置数据推断投资者的风险偏好和预期回报
      • 深度强化学习代理,聚合预期收益的推断序列,以制定新的多周期均值-方差投资组合优化问题

智能订单路由

  • 暗池与亮池
    • 暗池是投资公众无法访问的用于交易证券的私人交易所
      • 创建暗池是为了促进机构投资者的大宗交易,他们不希望用大额订单影响市场并为其交易获得不利的价格
      • 三种类型的暗池:(1) 经纪交易商拥有的暗池,(2) 代理经纪商或交易所拥有的暗池,以及 (3) 电子做市商暗池。
    • 点亮池确实显示不同股票的出价和要价
      • 主要交易所的运作方式是随时显示可用流动性,并形成可供交易者使用的大部分点燃池。
  • 不同暗池最重要的特征
    • 与交易对手匹配的机会
    • 价格(缺点)优势
  • 照明池的特点
    • 订单流
    • 队列大小
    • 取消率

国际机构强化学习应用典型案例(2013-2023)

机构名称 应用时间 项目名称 强化学习算法 应用场景 主要成果
JP Morgan 2020-2023 ATOM (Algorithmic Trading Optimization Machine) DQN, PPO 最优执行策略 - 将交易成本降低50%以上
- 提高大额订单执行效率
- 减少市场冲击
- 实现自适应交易策略
BlackRock 2019-2023 Aladdin (Asset, Liability, Debt and Derivative Investment Network) DDPG, SAC 投资组合管理 - 管理超过10万亿美元资产
- 投资组合夏普比率提升15-20%
- 显著降低交易成本
- 实现多目标动态优化
机构名称 应用时间 项目名称 强化学习算法 应用场景 主要成果
Two Sigma 2018-2023 Venn Platform A3C, PPO 因子投资与风险管理 - 识别200+个新型因子
- 风险调整收益提升30%
- 优化投资组合动态再平衡
- 提高风险预警准确率
Goldman Sachs 2017-2023 Atlas Trading Platform DQN, TRPO 做市商策略优化 - 做市利润提升40%
- 提高市场流动性
- 降低库存风险
- 优化报价策略
Citadel 2016-2023 Tactical Trading System TD3, SAC 战术性交易策略 - 年化收益提升25%
- 降低交易滑点
- 提高订单执行效率
- 优化多市场套利

1. JP Morgan - ATOM项目


技术特点

  • 采用分层强化学习架构
  • 实时市场数据接入
  • 多agent协同决策
  • 自适应风险控制

创新点

  • 动态调整执行策略
  • 智能订单切分
  • 多市场流动性整合
  • 实时交易成本优化

2. BlackRock - Aladdin系统


核心功能

  • 全资产类别覆盖
  • 多目标组合优化
  • 风险实时监控
  • 动态再平衡

应用效果

  • 显著提升投资效率
  • 优化资源配置
  • 增强风险管理
  • 降低运营成本

3. Two Sigma - Venn平台


技术优势

  • 高维因子处理
  • 实时市场适应
  • 多策略集成
  • 自动化因子挖掘

业务价值

  • 提升投资效率
  • 发现市场机会
  • 优化风险收益
  • 提高决策质量

4. Goldman Sachs - Atlas平台


系统特点

  • 高频实时处理
  • 智能做市定价
  • 动态风险控制
  • 多市场协同

实际效果

  • 提高市场份额
  • 降低交易成本
  • 优化库存管理
  • 提升做市效率

5. Citadel - 战术交易系统


技术创新

  • 多策略集成
  • 实时套利识别
  • 智能订单路由
  • 动态风险平衡

业务成果

  • 提高交易效率
  • 降低执行成本
  • 增强风控能力
  • 优化资金利用

统计推断与监督学习


性质 统计推断 监督机器学习
目标 具有解释力的因果模型 预测表现,往往解释力有限
数据 数据由模型生成 数据生成过程未知
框架 概率 算法和概率
表达能力 通常是线性的 非线性
模型选择 基于信息准则 数值优化
可扩展性 仅限于低维数据 缩放至高维输入数据
稳健性 容易出现过度拟合 专为样本外性能而设计
诊断 广泛 有限

ML Algorithm Types

Selecting ML Algorithms

(机器学习)常用Python库

Math Libraries

Statistical Libraries

ML and Deep Learning

(An incomplete list) Reference (from top finance journals)

[1] Athey S. The impact of machine learning on economics[J]. The economics of artificial intelligence: An agenda, 2018: 507-547.

[2] Athey S, Imbens G W. Machine learning methods that economists should know about[J]. Annual Review of Economics, 2019, 11: 685-725.

[3] Mullainathan S, Spiess J. Machine learning: an applied econometric approach[J]. Journal of Economic Perspectives, 2017, 31(2): 87-106.

[4] Cohen, Samuel N. and Snow, Derek and Szpruch, Lukasz, Black-Box Model Risk in Finance (February 9, 2021). Available at SSRN: https://ssrn.com/abstract=3782412 or http://dx.doi.org/10.2139/ssrn.3782412

[5] Goldstein I, Spatt C S, Ye M. Big data in finance[J]. The Review of Financial Studies, 2021, 34(7): 3213-3225.

[6] Erel I, Stern L H, Tan C, et al. Selecting directors using machine learning[J]. The Review of Financial Studies, 2021, 34(7): 3226-3264. [7] Li K, Mai F, Shen R, et al. Measuring corporate culture using machine learning[J]. The Review of Financial Studies, 2021, 34(7): 3265-3315. [8] Amel-Zadeh, Amir and Calliess, Jan-Peter and Kaiser, Daniel and Roberts, Stephen, Machine Learning-Based Financial Statement Analysis (November 25, 2020). Available at SSRN: https://ssrn.com/abstract=3520684 or http://dx.doi.org/10.2139/ssrn.3520684 [9] Gu S, Kelly B, Xiu D. Empirical asset pricing via machine learning[J]. The Review of Financial Studies, 2020, 33(5): 2223-2273.

[10] Giglio, Stefano and Kelly, Bryan T. and Xiu, Dacheng, Factor Models, Machine Learning, and Asset Pricing (October 15, 2021). Available at SSRN: https://ssrn.com/abstract=3943284 or http://dx.doi.org/10.2139/ssrn.3943284

[11] Gu S, Kelly B, Xiu D. Autoencoder asset pricing models[J]. Journal of Econometrics, 2021, 222(1): 429-450.

[12] Kelly B T, Pruitt S, Su Y. Characteristics are covariances: A unified model of risk and return[J]. Journal of Financial Economics, 2019, 134(3): 501-524.

[13] Kozak S, Nagel S, Santosh S. Shrinking the cross-section[J]. Journal of Financial Economics, 2020, 135(2): 271-292.

[14] Tobek O, Hronec M. Does it pay to follow anomalies research? machine learning approach with international evidence[J]. Journal of Financial Markets, 2021, 56: 100588.

[15] Baba Yara, Fahiz and Boyer, Brian H. and Davis, Carter, The Factor Model Failure Puzzle (November 19, 2021). Available at SSRN: https://ssrn.com/abstract=3967588 or http://dx.doi.org/10.2139/ssrn.3967588

[16] Chen L, Pelger M, Zhu J. Deep learning in asset pricing[J]. Management Science, 2023.
[17] Bryzgalova, Svetlana and Pelger, Markus and Zhu, Jason, Forest Through the Trees: Building Cross-Sections of Stock Returns (September 25, 2020). Available at SSRN: https://ssrn.com/abstract=3493458 or http://dx.doi.org/10.2139/ssrn.3493458

[18] Giglio S, Liao Y, Xiu D. Thousands of alpha tests[J]. The Review of Financial Studies, 2021, 34(7): 3456-3496.

[19] Duarte V, Fonseca J, Goodman A S, et al. Simple Allocation Rules and Optimal Portfolio Choice Over the Lifecycle[R]. National Bureau of Economic Research, 2021.

[20] Jiang, Jingwen and Kelly, Bryan T. and Xiu, Dacheng, (Re-)Imag(in)ing Price Trends (December 1, 2020). Chicago Booth Research Paper No. 21-01, Available at SSRN: https://ssrn.com/abstract=3756587 or http://dx.doi.org/10.2139/ssrn.3756587

[21] Aït-Sahalia Y, Xiu D. Using principal component analysis to estimate a high dimensional factor model with high-frequency data[J]. Journal of Econometrics, 2017, 201(2): 384-399.

[22] Aït-Sahalia Y, Xiu D. Principal component analysis of high-frequency data[J]. Journal of the American Statistical Association, 2019, 114(525): 287-303.

[23] Kelly B T, Xiu D. Financial machine learning[R]. National Bureau of Economic Research, 2023.

[24] Lopez-Lira A, Tang Y. Can chatgpt forecast stock price movements? return predictability and large language models[J]. arXiv preprint arXiv:2304.07619, 2023.

[25] Yu S, Xue H, Ao X, et al. Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning[J]. arXiv preprint arXiv:2306.12964, 2023.

[26] Blitz D, Hanauer M X, Hoogteijling T, et al. The Term Structure of Machine Learning Alpha[J]. Available at SSRN, 2023.

[27] Hambly B, Xu R, Yang H. Recent advances in reinforcement learning in finance[J]. Mathematical Finance, 2023, 33(3): 437-503.

02 回归算法(Regressions)

  • 线性回归
  • 惩罚回归
  • 非线性回归
  • 稳健的线性回归

最小二乘线性回归

  • 线性回归模型

  • 模型参数:
    • 权重(weights)回归系数(regression coefficients)
    • 偏移(offset)偏差(bias): ()
    • by writing as , we can write as

  • 简单线性回归(simple linear regression)
  • 多元线性回归(multiple linear regression)
  • 多变量线性回归(multivariate linear regression)

  • 如果不能很好地通过的线性函数拟合
    • 将非线性变换特征提取器)应用到
    • 只要 的参数固定,模型就保持参数线性

最小二乘估计(Least squares estimation)

  • minimizing the negative log likelihood (NLL):

    • The MLE is the point where:
  • the NLL is equal to the residual sum of squares (RSS)

OLS

the 正规方程(normal equation) (FOC)

the OLS solution

the solution is unique since tha Hessian is positive definite

OLS的统计性质 (有限样本)

  • 无偏(unbiasedness):
  • 方差(variance):
  • 高斯-马可夫(Gauss-Markov): the OLS estimator is efficient in the class of linear unbiased estimators. That is, for any unbiased estimator that is linear in , in the matrix sense.

Geometric interpretation of least squares

  • 正交投影(orthogonal projection)

  • 投影矩阵(帽子矩阵)

  • 特殊情况:

加权最小二乘

  • heteroskedastic regression

  • weighted linear regression

  • MLE (weighted least squares estimate):

Measuring goodness of fit

    • T-total:
    • E-explained:
    • R-residual:



Penalized (linear) regressions

Ridge regression

  • Ridge regression: MAP estimation with a zero-mean Gaussian prior on the weights

  • MAP estimate

where is proportional to the strength of the prior, and

  • regularization or weight decay

选择正则化的强度

  • 简单(但昂贵)的想法
    • 尝试有限数量的不同值
    • 使用交叉验证来估计他们的预期损失
  • 实用的方法
    • 从高度约束的模型开始(强正则化器)
    • 逐渐放宽约束(减少正则化量)
  • 经验贝叶斯方法:
    • 得到与 CV 估计相同的结果
    • 可以通过拟合单个模型来完成
    • 使用基于梯度的优化而不是离散搜索

Lasso regression

  • least absolute shrinkage and selection operator(LASSO)

  • -regularization: MAP estimation with a Laplace prior

  • other norms
    • in general:
      • even sparser solutions
      • the problem becomes non-convex
    • (norm):

Why does regularization yield sparse solutions?

Algorithms Lagrangian Constrained quadratic program
lasso

ridge

Hard vs soft thresholding

Consider the partial derivatives of the lasso objective

  • the NLL part
    • FOC

  • the solution:
  • adding the part

  • the solution
    • If , so the feature is strongly negatively correlated with the residual, then the subgradient is zero at .
    • If , so the feature is only weakly correlated with the residual, then the subgradient is zero at .
    • If , so the feature is strongly positively correlated with the residual, then the subgradient is zero at .

  • We can write this as
  • hard thresholding:
    • for
    • does not shrink the values of for other cases
  • debiasing: the two-stage estimation process
    • run lasso to get the sparse estimaion
    • run ols with the variable from lasso

Regularization path

Plot the values vs (or vs the bound ) for each feature .

Group lasso

  • group sparsity
    • many parameters associated with a given variable
    • a vector of weights for variable
    • If we want to exclude variable , we have to force the whole subvector to go to zero
  • applications
    • Linear regression with categorical inputs: If the ’th variable is categorical with possible levels, then it will be represented as a one-hot vector of length , so to exclude variable , we have to set the whole vector of incoming weights to .
  • applications
    • Multinomial logistic regression: The ’th variable will be associated with different weights, one per class, so to exclude variable , we have to set the whole vector of outgoing weights to .
    • Neural networks: the ’th neuron will have multiple inputs, so if we want to “turn the neuron off”, we have to set all the incoming weights to zero. This allows us to use group sparsity to learn neural network structure.
    • Multi-task learning: each input feature is associated with different weights, one per output task. If we want to use a feature for all of the tasks or none of the tasks, we should select weights at the group level.

Elastic net (ridge and lasso combined)


Nonelinear regression

Polynomial regression

  • nonlinear in s
  • still linear in the parameters (s)
  • similar to multi-linear regression
  • polynomial function imposes global structure

Coding: Polynomial regression



  • imports
import numpy as np
import pandas as pd
import patsy as pt
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
  • load wage dataset
wage_df = pd.read_csv('./data/Wage.csv')
wage_df = wage_df.drop(wage_df.columns[0], axis=1)
wage_df['education'] = wage_df['education'].map({'1. < HS Grad': 1.0, 
                                                 '2. HS Grad': 2.0, 
                                                 '3. Some College': 3.0,
                                                 '4. College Grad': 4.0,
                                                 '5. Advanced Degree': 5.0
                                                })
wage_df.head()
  • preview of the dataset

year age maritl race education region jobclass health health_ins logwage wage
2006 18 1. Never Married 1. White 1.0 2. Middle Atlantic 1. Industrial 1. <=Good 2. No 4.318063 75.043154
2004 24 1. Never Married 1. White 4.0 2. Middle Atlantic 2. Information 2. >=Very Good 2. No 4.255273 70.476020
2003 45 2. Married 1. White 3.0 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes 4.875061 130.982177
2003 43 2. Married 3. Asian 4.0 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes 5.041393 154.685293
2005 50 4. Divorced 1. White 2.0 2. Middle Atlantic 2. Information 1. <=Good 1. Yes 4.318063 75.043154
  • Regression & results

# Derive 4 degree polynomial features of age
degree = 4
f = ' + '.join(['np.power(age, {})'.format(i)
                for i in np.arange(1, degree+1)])
X = pt.dmatrix(f, wage_df)
y = np.asarray(wage_df['wage'])

# Fit linear model
model = sm.OLS(y, X).fit()
y_hat = model.predict(X)
model.summary()
  • Plots
# STATS
# ----------------------------------
# Covariance of coefficient estimates
mse = np.sum(np.square(y_hat - y)) / y.size
cov = mse * np.linalg.inv(X.T @ X)
# ...or alternatively this stat is provided by stats models:
#cov = model.cov_params()

# Calculate variance of f(x)
var_f = np.diagonal((X @ cov) @ X.T)
# Derive standard error of f(x) from variance
se       = np.sqrt(var_f)
conf_int = 2*se

# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=X[:, 1], y=y_hat, ax=ax, color='blue');

# Plot confidence intervals
sns.lineplot(x=X[:, 1], y=y_hat+conf_int, color='blue');
sns.lineplot(x=X[:, 1], y=y_hat-conf_int, color='blue');
# dash confidnece int
ax.lines[1].set_linestyle("--")
ax.lines[2].set_linestyle("--")

Step functions

  • no global structure
  • break the range of into bins

  • fit a different constant in each bin

  • unless there are natural breakpoints in the predictors, piecewise-constant functions can miss the action

Coding: Step function

### Step function
steps = 6

# Segment data into 4 segments by age
cuts = pd.cut(wage_df['age'], steps)
X = np.asarray(pd.get_dummies(cuts))
y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)


# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], 
                wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=wage_df['age'], y=y_hat, ax=ax, color='blue')

Basis functions

  • Polynomial and piecewise-constant regression models are special cases of a basis function approach.
  • Basis fucntion: a family of functions or transformations that can be applied to a variable : .
  • The model

  • some examples of basis fucntions
    • polynomial function
    • piecewise constant function
    • wavelet
    • Fourier series
    • splines

Regression splines

Piecewise Polynomials

  • fitting separate low-degree polynomials over different regions of .

  • example: piecewise cubic polynomial with a single knot at a point

  • degree of freedom

  • Using more knots leads to a more flexible piecewise polynomial

Constraints and Splines

  • piecewise cubic: no constraint
  • continuous piecewise cubic: continuity of
  • cubic spline: continuity of , and
  • degree- spline: a piecewise degree- polynomial, with continuity in derivatives up to degree at each knot

The Spline Basis Representation

  • regression spline:

  • the spline basis
    • polynomial basis: , , and
    • truncated power basis for each knot

  • splines can have high variance at the outer range of the predictors
  • natural spline
    • a regression spline with additional boundary constraints
    • the function is required to be linear at the boundary
    • natural splines generally produce more stable estimates at the boundaries

Choosing the Number and Locations of the Knots

  • locations of knots (given the number fixed)
    • more knots -> the function might vary most rapidly; fewer knots -> seems more stable
    • in practice: place knots in a uniform fashion
      • specify the desired degrees of freedom
      • software automatically place knots
  • number of knots
    • try out diferent numbers of knots
    • cross-validation

Comparison to Polynomial Regression

  • natural cubic spline with 15 degrees of freedom vs. degree-15 polynomial
  • natural cubic spline works better on boundaries
  • in general, natural cubic spline produces more stable estimates

Smoothing splines

An Overview of Smoothing Splines

  • fitting a curve: minimize RSS to be small.
  • should be a smooth function (WHY? & HOW?)
  • smoothing spline minimize the following objective

  • the smoothing spline is a natural cubic spline with knots at
    • piecewise cubic polynomial with knots at the unique values of
    • continuous first and second derivatives at each knot
    • linear in the region outside of the extreme knots
    • it is a shrunken version of such a natural cubic spline

Coding: Regression spline

# Putting confidence interval calcs into function for convenience.
def confidence_interval(X, y, y_hat):
    """Compute 5% confidence interval for linear regression"""
    # STATS
    # ----------------------------------    
    # Covariance of coefficient estimates
    mse = np.sum(np.square(y_hat - y)) / y.size
    cov = mse * np.linalg.inv(X.T @ X)
    # ...or alternatively this stat is provided by stats models:
    #cov = model.cov_params()
    
    # Calculate variance of f(x)
    var_f = np.diagonal((X @ cov) @ X.T)
    # Derive standard error of f(x) from variance
    se       = np.sqrt(var_f)
    conf_int = 2*se
    return conf_int

# Fit spline with 6 degrees of freedom

# Use patsy to generate entire matrix of basis functions
X = pt.dmatrix('bs(age, df=7, degree=3, include_intercept=True)', wage_df)
y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)
conf_int = confidence_interval(X, y, y_hat)

# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=wage_df['age'], y=y_hat, ax=ax, color='blue');

# Plot confidence intervals
sns.lineplot(x=wage_df['age'], y=y_hat+conf_int, color='blue');
sns.lineplot(x=wage_df['age'], y=y_hat-conf_int, color='blue');
# dash confidnece int
ax.lines[1].set_linestyle("--")
ax.lines[2].set_linestyle("--")

Coding: Natural spline

# Fit a natural spline with seven degrees of freedom

# Use patsy to generate entire matrix of basis functions
X = pt.dmatrix('cr(age, df=7)', wage_df)     
# REVISION NOTE: Something funky happens when df=6

y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)
conf_int = confidence_interval(X, y, y_hat)

# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=wage_df['age'], y=y_hat, ax=ax, color='blue');

# Plot confidence intervals
sns.lineplot(x=wage_df['age'], y=y_hat+conf_int, color='blue');
sns.lineplot(x=wage_df['age'], y=y_hat-conf_int, color='blue');
# dash confidnece int
ax.lines[1].set_linestyle("--")
ax.lines[2].set_linestyle("--")

Local regression


computing the fit at a target point using only the nearby training observations

Algorithm

Algorithm: Local Regression At
1. Gather the fraction of training points whose are closest to .
2. Assign a weight to each point in this neighborhood, so that the point furthest from has weight zero, and the closest has the highest weight. All but these nearest neighbors get weight zero.
3. Fit a weighted least squares regression of the on the using the aforementioned weights, by finding and that minimize
4. The fitted value at is given by .

Local linear regression

Generalized additive models

GAMs for Regression Problems

  • the multiple linear regression model

  • GAM

An Example (using natural spline)

  • year and age are quantitative variables
  • education is qualitative with five levels: , HS, <Coll, Coll, >Coll
  • natural splines

An Example (using smoothing spline)

Pros and Cons of GAMs

  • Pros
    • GAMs automatically model non-linear relationships that standard linear regression will miss.

    • The non-linear fits can potentially make more accurate predictions for the response .

    • We can examine the effect of each on individually while holding all of the other variables fixed.

    • The smoothness of the function for the variable can be summarized via degrees of freedom.

  • Cons: the model is restricted to be additive.

Coding: GAM


# Use patsy to generate entire matrix of basis functions
X = pt.dmatrix('cr(year, df=4)+cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)
conf_int = confidence_interval(X, y, y_hat)
# Plot estimated f(year)
sns.lineplot(x=wage_df['year'], y=y_hat)
# Plot estimated f(age)
sns.lineplot(x=wage_df['age'], y=y_hat);
# Plot estimated f(age)
sns.lineplot(x=wage_df['age'], y=y_hat);
  • Comparing GAM configurations with ANOVA
# Model 1
X = pt.dmatrix('cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])
model1 = sm.OLS(y, X).fit(disp=0)
# Model 2
X = pt.dmatrix('year+cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])
model2 = sm.OLS(y, X).fit(disp=0)
# Model 3
X = pt.dmatrix('cr(year, df=4)+cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])
model3 = sm.OLS(y, X).fit(disp=0)

# Compare models with ANOVA
display(sm.stats.anova_lm(model1, model2, model3))
df_resid ssr df_diff ss_diff F Pr(>F)
0 2994.0 3.750437e+06 0.0 NaN NaN
1 2993.0 3.732809e+06 1.0 17627.473318 14.129318
2 2991.0 3.731516e+06 2.0 1293.696286 0.518482
display(model3.summary())
  • Local regression GAM

x = np.asarray(wage_df['age'])
y = np.asarray(wage_df['wage'])
# Create lowess feature for age
wage_df['age_lowess'] = sm.nonparametric.lowess(
    y, x, frac=.7, return_sorted=False)

# Fit logistic regression model
X = pt.dmatrix('cr(year, df=4)+ age_lowess
                + education', wage_df)
y = np.asarray(wage_df['wage'])
model = sm.OLS(y, X).fit(disp=0)
model.summary()

Robust linear regression

  • Gaussian noise assumption

    • OLS estimator = MLE
    • Poor performace for outliers
  • Robust regression: replace the Gaussian distribution for the response
    variable with a distribution that has heavy tails

Likelihood Prior Posterior Name
Gaussian Uniform Point Least squares
Student Uniform Point Robust regression
Laplace Uniform Point Robust regression
Gaussian Gaussian Point Ridge
Gaussian Laplace Point Lasso
Gaussian Gauss-Gamma Gauss-Gamma Bayesian linear regression

Laplace likelihood

Student- likelihood

  • We can fit this model using SGD or EM

Huber loss

  • An alternative to minimizing the NLL using a Laplace or Student likelihood is to use the Huber loss:

  • It is equivalent to for errors that are smaller than , and is equivalent to for larger errors.

  • Huber loss function is everywhere differentiable.

  • optimizing the Huber loss is much faster than using the Laplace likelihood

  • controls the degree of robustness

    • set by hand
    • by crossvalidation

03 分类算法(Classification)

  • 逻辑回归
  • 分类生成模型(LDA、QDA、朴素贝叶斯)
  • 广义可加模型(GAM)
  • 支持向量机

Examples of classification problems

  • A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
  • An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
  • On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to fgure out which DNA mutations are deleterious (disease-causing) and which are not.

Regression is not appropriate for classification tasks

  • regression methods can not accommodate a qualitative response with more than wo classes

  • regression methods can not provide estimation of the conditional probability of responses

The default dataset

default student balance income
1 No No 729.5264952 44361.62507
2 No Yes 817.1804066 12106.1347
3 No No 1073.549164 31767.13895
4 No No 529.2506047 35704.49394
5 No No 785.6558829 38463.49588

Losgistic regression

The logistic model

  • The probability of default

  • Linear regression

  • logistic function

Estimation and Predictions

  • the likelihood function:

  • prediction: for an individual with a balance of is

  • prediction: for student status

Coefcient Std. error z-statistic p-value
Intercept −10.6513 0.3612 −29.5 <0.0001
balance 0.0055 0.0002 24.9 <0.0001

Coefcient Std. error z-statistic p-value
Intercept −3.5041 0.0707 −49.55 <0.0001
student[Yes] 0.4049 0.1150 3.52 0.0004

Multiple logistic regression

  • the model of odds

  • the model of

Coefcient Std. error z-statistic p-value
Intercept −10.8690 0.4923 −22.08 <0.0001
balance 0.0057 0.0002 24.74 <0.0001
income 0.0030 0.0082 0.37 0.7115
student[Yes] −0.6468 0.2362 −2.74 0.0062
  • prediction
  • A student with a credit card balance of $1,500 and an income of $40, 000 has an estimated probability of default of

  • A non-student with the same balance and income has an estimated probability of default of

Multinomial logistic regression

  • classify a response variable that has more than two classes

  • the model

    • for : the baseline

  • for

  • the log odds (for )

Coding: Logistic Regression


  • imports
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import \
     (LinearDiscriminantAnalysis as LDA,
      QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
  • load data
Smarket = load_data('Smarket')
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
0 2001 0.381 -0.192 -2.624 -1.055 5.010 1.19130 0.959 Up
1 2001 0.959 0.381 -0.192 -2.624 -1.055 1.29650 1.032 Up
2 2001 1.032 0.959 0.381 -0.192 -2.624 1.41120 -0.623 Down
3 2001 -0.623 1.032 0.959 0.381 -0.192 1.27600 0.614 Up
... ... ... ... ... ... ... ... ... ...
1249 2005 -0.298 0.130 -0.955 0.043 0.422 1.38254 -0.489 Down

  • fitting
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
glm = sm.GLM(y,
             X,
             family=sm.families.Binomial())
results = glm.fit()
summarize(results)
coef std err z
intercept -0.1260 0.241 -0.523 0.601
Lag1 -0.0731 0.050 -1.457 0.145
Lag2 -0.0423 0.050 -0.845 0.398
Lag3 0.0111 0.050 0.222 0.824
Lag4 0.0094 0.050 0.187 0.851
Lag5 0.0103 0.050 0.208 0.835
Volume 0.1354 0.158 0.855 0.392
  • predicting
probs = results.predict()
labels = np.array(['Down']*1250)
labels[probs>0.5] = 'Up'
confusion_table(labels, Smarket.Direction)
Truth Down Up
Predicted
Down 145 141
Up 457 507

Generative models for classification

  • The bigidea of generative models for classification
    • model the distribution of the predictors separately in each of the response classes
    • use Bayes' theorem to flip these around into estimates for
  • Why do we need the generative models for classification
    • When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable
    • If the distribution of the predictors is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression
    • The methods in this section can be naturally extended to the case of more than two response classes
  • Suppose the qualitative response variable can take on possible distinct and unordered values.
  • Let represent the overall or prior probability that a randomly chosen observation comes from the th class.
  • Let denote the density function of for an observation that comes from the th class
  • the posterior probability (The Bayes' Theorem)

  • estimation
    • instead of directly computing the posterior probability , we can simply plug in estimates of and
    • : we simply compute the fraction of the training observations that belong to the th class
    • : is much more challenging

Linear discriminant analysis (LDA) for

  • assumptions
    • only one predictor:
    • is normal / Gaussian

  • the posterior

  • prediction
    • classify an observation to the class for which is greatest
    • an equivalent law: to assigning the observation to the class for which maximize

An Example

  • and
  • assigns an observation to class 1 if , and to class 2 otherwise
  • The Bayes decision boundary

LDA for

  • The multivariate Gaussian distribution
  • jointed density


  • LHS: and

  • RHS: correlated / have unequal variances
  • the observations in the th class are drawn from a multivariate Gaussian distribution
    • is a class-specific mean vector
    • is a covariance matrix that is common to all classes
  • the Bayes classifier assigns an observation to the class for which maximizes

  • The Bayes decision boundary solves

Coding: LDA


  • fitting
lda = LDA(store_covariance=True)
X_train, X_test = [M.drop(columns=['intercept'])
                   for M in [X_train, X_test]]
lda.fit(X_train, L_train)
  • predictiion
lda_pred = lda.predict(X_test)
confusion_table(lda_pred, L_test)
Truth Down Up
Predicted
Down 35 35
Up 76 106

Quadratic discriminant analysis (QDA)

  • each class has its own covariance matrix

  • the Bayes classifier assigns an observation to the class for which maximizes

Coding: QDA


  • fitting
qda = QDA(store_covariance=True)
qda.fit(X_train, L_train)
  • predictiion
qda_pred = qda.predict(X_test)
confusion_table(qda_pred, L_test)
Truth Down Up
Predicted
Down 30 20
Up 81 121

Naive Bayes

  • Assumption: Within the kth class, the p predictors are independent

  • the posterior probability

Estimating the one-dimensional density function using training data

  • If is quantitative, then we can assume that
  • If is quantitative use a non-parametric estimate for
    • making a histogram for the observations of the th predictor within each class
    • kernel density estimator
  • If is qualitative count the proportion of training observations for the th predictor corresponding to each class

Coding: QDA


  • fitting
NB = GaussianNB()
NB.fit(X_train, L_train)
  • predictiion
nb_labels = NB.predict(X_test)
confusion_table(nb_labels, L_test)
Truth Down Up
Predicted
Down 29 20
Up 82 121

Generalized additive models

Model the log odds ratio as a generalized additive models:

Support vector machine

  • developed in the 1990s
  • perform well in a variety of settings
  • often considered one of the best "out of the box" classifiers.

Maximal Margin Classifier

Hyperplane

  • In a -dimensional space: a flat affine subspace of dimension

    • In 2-d: a line
    • In 3-d: a plane
    • In -d (): hard to visualize
  • The mathematical definition (for the -d setting)

  • the 2-d example:

Classification Using a Separating Hyperplane

for a seperating hyperplane

and

Equivalently, a separating hyperplane has the property that

for all .

Separating Hyperplanes

  • we classify the test observation based on the sign of

    • class 1
    • class -1
  • the magnitude of

The Maximal Margin Classifier

  • margin
  • maximal margin hyperplane (a.k.a. the optimal separating hyperplane)
  • Construction of the Maximal Margin Classifier

The Non-separable Case & Noisy Data

  • sometimes data are non-seperateble
  • sometimes the maximal margin classifier is very sensitive to noisy data

Support Vector Classifiers

  • : the i-th obs is on the correct side of the margin
  • : the i-th obs is on the wrong side of the margin
  • : the i-th obs is on the wrong side of the hyperplane

Parameter

  • is the budget for the amount that the margin can be violated by the observations
    • : no budget for violations to the margin
    • : no more than obs can be on the wrong side of the hyperplane
    • : margin will widen
  • controls the bias-variance trade-off
    • small : low bias, high variance
    • big : high bias, low variance
    • selected via CV
  • An observation: only observations that either lie on the margin or that violate the margin (support vectors) will afect the hyperplane

Coding: SVC

  • imports
import numpy as np
from matplotlib.pyplot import subplots, cm
import sklearn.model_selection as skm
from ISLP import load_data, confusion_table
from sklearn.svm import SVC
from ISLP.svm import plot as plot_svm
from sklearn.metrics import RocCurveDisplay
roc_curve = RocCurveDisplay.from_estimator # shorthand
  • create data
rng = np.random.default_rng(1)
X = rng.standard_normal((50, 2))
y = np.array([-1]*25+[1]*25)
X[y==1] += 1
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0],
           X[:,1],
           c=y,
           cmap=cm.coolwarm)
  • fitting
svm_linear = SVC(C=10, kernel='linear')
svm_linear.fit(X, y)
  • plotting
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
         y,
         svm_linear,
         ax=ax)

You may try with other parameter values (e.g. ).

  • hyperparameter tuning
kfold = skm.KFold(5, 
                  random_state=0,
                  shuffle=True)
grid = skm.GridSearchCV(svm_linear,
                        {'C':[0.001,0.01,0.1,1,5,10,100]},
                        refit=True,
                        cv=kfold,
                        scoring='accuracy')
grid.fit(X, y)
grid.best_params_

{'C': 1}

  • printing results
grid.cv_results_[('mean_test_score')]

array([0.46, 0.46, 0.72, 0.74, 0.74, 0.74, 0.74])

  • generating testing sample
X_test = rng.standard_normal((20, 2))
y_test = np.array([-1]*10+[1]*10)
X_test[y_test==1] += 1
  • predicting
best_ = grid.best_estimator_
y_test_hat = best_.predict(X_test)
confusion_table(y_test_hat, y_test)


Truth -1 1
Predicted
-1 8 4
1 2 6

Support Vector Machines

Support Vector Machines can not handle nonlinearity.

What can we do?

Nonlinear Classifiers Utilizing Polynomial Features

  • The original features

  • The polynomial features

  • The SVM via Optimization

Kernel Functions

  • Definition:

  • is a kernel function if and only if the kernel matrix is positive semi-definite for any data .
  • Some examples of kernel functions
name function
Linear kernel
Polynomial kernel
Radial kernel
Gaussian kernel
Laplacian kernel
Sigmoid kernel

Suppose and are kernel functions:

  • The linear combination of and is a kernel function

  • The direct product of and is a kernel function

  • For any function , is a kernel function if

SVC vs. SVM

SVC
SVM
inner products / kernels

functional form

Coding: SVM


  • create data
X = rng.standard_normal((200, 2))
X[:100] += 2
X[100:150] -= 2
y = np.array([1]*150+[2]*50)
  • plotting
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0],
           X[:,1],
           c=y,
           cmap=cm.coolwarm)
  • fitting
(X_train, 
 X_test,
 y_train,
 y_test) = skm.train_test_split(X,
                                y,
                                test_size=0.5,
                                random_state=0)
svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
svm_rbf.fit(X_train, y_train)
  • plotting
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
         y_train,
         svm_rbf,
         ax=ax)

You may try with other parameter values (e.g. ).

  • hyperparameter tuning
kfold = skm.KFold(5, 
                  random_state=0,
                  shuffle=True)
grid = skm.GridSearchCV(svm_rbf,
                        {'C':[0.1,1,10,100,1000],
                         'gamma':[0.5,1,2,3,4]},
                        refit=True,
                        cv=kfold,
                        scoring='accuracy');
grid.fit(X_train, y_train)
grid.best_params_

{'C': 1, 'gamma': 0.5}

  • training with the best hyperparameters
best_svm = grid.best_estimator_
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
         y_train,
         best_svm,
         ax=ax)
y_hat_test = best_svm.predict(X_test)
confusion_table(y_hat_test, y_test)
Truth -1 1
Predicted
-1 8 4
1 2 6

SVMs with More than Two Classes

  • One-Versus-One (OVO) Classification

    • a.k.a. all-pairs
    • constructs a SVM for each pair of classes
    • classify a test obs using each of the SVMs
    • assign the obs to the class to which it was most frequently assigned in
  • One-Versus-All (OVA) Classification

    • a.k.a. one-versus-rest
    • fit SVM for each class (coded as "1" and the rest was coded as "-1")
    • let denote the parameters
    • assign the observation to the class for which is largest

Coding: SVM with Multiple Classes


  • data and plotting
rng = np.random.default_rng(123)
X = np.vstack([X, rng.standard_normal((50, 2))])
y = np.hstack([y, [0]*50])
X[y==0,1] += 2
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm)
  • fitting
svm_rbf_3 = SVC(kernel="rbf",
                C=10,
                gamma=1,
                decision_function_shape='ovo');
svm_rbf_3.fit(X, y)
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
         y,
         svm_rbf_3,
         scatter_cmap=cm.tab10,
         ax=ax)

Relationship to Logistic Regression

  • The hinge loss + penalty form of support-vector classifier optimization:

    • let
    • the optimization model

    • it is very similar to “loss” in logistic regression (negative log-likelihood).
  • SVM vs. Logistic Regression

    • When classes are (nearly) separable, SVM does better than LR. So does LDA.
    • When not, LR (with ridge penalty) and SVM very similar.
    • If you wish to estimate probabilities, LR is the choice.
    • For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive.

04 决策树与集成学习

  • CART
  • 集成学习

Classification and regression trees (CART)

  • CART / Decision Tree
    • recursively partitioning the input space
    • a local model in each resulting region of input space

Model definition

  • regression trees: all inputs are real-valued.
    • The tree consists of a set of nested decision rules. At each node , the feature dimension of the input vector is compared to a threshold value , and the input is then passed down to the left or right branch, depending on whether it is above or below threshold.
    • At the leaves of the tree, the model specifies the predicted output for any input that falls into that part of the input space.
  • An example:
    • the region of space:

  • the output for region 1 (mean response) can be estimated using

  • Formally, a regression tree can be defined by

  • is the region specified by the 'th leaf node, is the predicted output for that node, and
  • is the number of nodes.
  • The regions:
    • , etc.
  • For categorical inputs, we can define the splits based on comparing feature to each of the possible values for that feature, rather than comparing to a numeric threshold.
  • For classification problems, the leaves contain a distribution over the class labels, rather than just the mean response.

Model fitting

  • minimizing the following loss:

  • it is not differentiable
  • finding the optimal partitioning of the data is NP-complete
  • the standard practice is to use a greedy procedure, in which we iteratively grow the tree one node at a time.
  • three popular implementations: CART, C4.5, and ID3.

The big idea of greedy algorithms

  • let be the set of examples that reach node .
  • If the 'the feature is a real-valued scalar
    • partition the data at node by comparing to a threshold
    • the set of possible thresholds for feature can be obtained by sorting the unique values of
    • example: if feature 1 has the values , then we set . For each possible threshold, we define the left and right splits, and .
  • If the 'th feature is categorical with possible values
    • check if the feature is equal to each of those values or not.
    • it defines a set of possible binary splits: : and .)
  • choose the best feature to split on, and the best value for that feature, , as:
  • the cost function of node .
  • For regression, we can use the mean squared error

where is the mean of the response variable for examples reaching node .

  • For classification, we first compute the empirical distribution over class labels for this node:

Given this, we can then compute the Gini index

This is the expected error rate. To see this, note that is the probability a random entry in the leaf belongs to class , and is the probability it would be misclassified.

  • Alternatively we can define cost as the entropy or deviance of the node:

  • A node that is pure (i.e., only has examples of one class) will have 0 entropy.

Regularization

  • the danger of overfitting: If we let the tree become deep enough, it can achieve 0 error on the training set (assuming no label noise), by partioning the input space into sufficiently small regions where the output is constant.
  • two main approaches against overfitting:
    • The first is to stop the tree growing process according to some heuristic, such as having too few examples at a node, or reaching a maximum depth.
    • The second approach is to grow the tree to its maximum depth, where no more splits are possible, and then to prune it back, by merging split subtrees back into their parent.

Pros and cons

  • advantages:
    • They are easy to interpret.
    • They can easily handle mixed discrete and continuous inputs.
    • They are insensitive to monotone transformations of the inputs (because the split points are based on ranking the data points), so there is no need to standardize the data.
    • They perform automatic variable selection.
    • They are relatively robust to outliers.
    • They are fast to fit, and scale well to large data sets.
    • They can handle missing input features.
  • disadvantages:
    • they do not predict very accurately compared to other kinds of model (the greedy nature of the tree construction algorithm).
    • trees are unstable: small changes to the input data can have large effects on the structure of the tree, due to the hierarchical nature of the tree-growing process, causing errors at the top to affect the rest of the tree.

Ensemble learning

  • Ensemble learning: reduce variance by averaging multiple (regression) models (or taking a majority vote for classifiers)

  • is the 'th base model.
  • The ensemble will have similar bias to the base models, but lower variance, generally resulting in improved overall performance
  • For classifiers: take a majority vote of the outputs. (This is sometimes called a committee method.)
    • suppose each base model is a binary classifier with an accuracy of , and suppose class 1 is the correct class.
    • Let be the prediction for the 'th model
    • Let be the number of votes for class 1
    • We define the final predictor to be the majority vote, i.e., class 1 if and class 0 otherwise. The probability that the ensemble will pick class 1 is

Stecking

  • Stecking (stacked generalization): combine the base models, by using

  • the combination weights used by stacking need to be trained on a separate dataset, otherwise they would put all their mass on the best performing base model.

Ensembling is not Bayes model averaging

  • An ensemble considers a larger hypothesis class of the form

  • the BMA uses

  • The key difference
    • in the case of BMA, the weights sum to one
    • in the limit of infinite data, only a single model will be chosen (namely the MAP model). By contrast,
    • the ensemble weights are arbitrary, and don't collapse in this way to a single model.

Bagging

  • bagging ("bootstrap aggregating")
    • This is a simple form of ensemble learning in which we fit different base models to different randomly sampled versions of the data
    • this encourages the different models to make diverse predictions
    • The datasets are sampled with replacement (a technique known as bootstrap sampling)
    • a given example may appear multiple times, until we have a total of examples per model (where is the number of original data points).
  • The disadvantage of bootstrap

    • each base model only sees, on average, of the unique input examples.

    • The of the training instances that are not used by a given base model are called out-of-bag instances (oob).

    • We can use the predicted performance of the base model on these oob instances as an estimate of test set performance.

    • This provides a useful alternative to cross validation.

  • The main advantage of bootstrap is that it prevents the ensemble from relying too much on any individual training example, which enhances robustness and generalization.

  • Bagging does not always improve performance. In particular, it relies on the base models being unstable estimators (decision tree), so that omitting some of the data significantly changes the resulting model fit.

Random forests

Random forests: learning trees based on a randomly chosen subset of input variables (at each node of the tree), as well as a randomly chosen subset of data cases.

It shows that random forests work much better than bagged decision trees, because many input features are irrelevant.

Boosting

  • Ensembles of trees, whether fit by bagging or the random forest algorithm, corresponding to a model of the form

  • is the 'th tree
  • is the corresponding weight, often set to .
  • additive model: generalize this by allowing the functions to be general function approximators, such as neural networks, not just trees.
  • We can think of this as a linear model with adaptive basis functions. The goal, as usual, is to minimize the empirical loss (with an optional regularizer):

  • Boosting is an algorithm for sequentially fitting additive models where each is a binary classifier that returns .

    • first fit on the original data
    • weight the data samples by the errors made by , so misclassified examples get more weight
    • fit to this weighted data set
    • keep repeating this process until we have fit the desired number of components.
  • as long as each has an accuracy that is better than chance (even on the weighted dataset), then the final ensemble of classifiers will have higher accuracy than any given component.

    • if is a weak learner (so its accuracy is only slightly better than )
    • we can boost its performance using the above procedure so that the final becomes a strong learner.
  • boosting, bagging and RF
    • boosting reduces the bias of the strong learner, by fitting trees that depend on each other
    • bagging and RF reduce the variance by fitting independent trees
    • In many cases, boosting can work better

Forward stagewise additive modeling

forward stagewise additive modeling: sequentially optimize the objective for general (differentiable) loss functions

We then set

Quadratic loss and least squares boosting

  • squared error loss
  • the 'th term in the objective at step becomes

  • is the residual of the current model on the 'th observation.
  • We can minimize the above objective by simply setting , and fitting to the residual errors. This is called least squares boosting.

Exponential loss and AdaBoost

  • binary classification, i.e., predicting .
  • assuming the weak learner computes

so returns half the log odds.

  • the negative log likelihood is given by

  • we can also use other loss functions. In this section, we consider the exponential loss

  • this is a smooth upper bound on the 0-1 loss.
  • In the population setting (with infinite sample size), the optimal solution to the exponential loss is the same as for loss. To see this, we can just set the derivative of the expected loss (for each ) to zero:

  • the exponential loss is easier to optimize in the boosting setting.

discrete AdaBoost

  • At step we have to minimize

where is a weight applied to datacase , and . We can rewrite this objective as follows:

  • the optimal function to add is

This can be found by applying the weak learner to a weighted version of the dataset, with weights .

  • Subsituting into and solving for we find

where

  • Therefore overall update becomes

  • the weights for the next iteration, as follows:

  • If , then , and if , then . Hence , so the update becomes

  • Since the is constant across all examples, it can be dropped. If we then define , the update becomes

Thus we see that we exponentially increase weights of misclassified examples. The resulting algorithm shown in Algorithm 8, and is known as Adaboost.

LogitBoost

  • exponential loss puts a lot of weight on misclassified examples
  • This makes the method very sensitive to outliers (mislabeled examples).
  • In addition, is not the logarithm of any pmf for binary variables ; consequently we cannot recover probability estimates from .
  • A natural alternative is to use loss. This only punishes mistakes linearly, as is clear from Figure 18.7.
  • Furthermore, it means that we will be able to extract probabilities from the final learned function, using

  • The goal is to minimze the expected log-loss, given by

Gradient boosting

  • to derive a generic version, known as gradient boosting. To explain this, imagine solving by performing gradient descent in the space of functions. Since functions are infinite dimensional objects, we will represent them by their values on the training set, . At step , let be the gradient of evaluated at :

  • make the update

where is the step length, chosen by

  • fitting a weak learner to approximate the negative gradient signal. That is, we use this update

Gradient tree boosting

  • In practice, gradient boosting nearly always assumes that the weak learner is a regression tree, which is a model of the form

  • To use this in gradient boosting, we first find good regions for tree using standard regression tree learning on the residuals; we then (re)solve for the weights of each leaf by solving

XGBoost

  • XGBoost: (https://github.com/dmlc/xgboost), which stands for "extreme gradient boosting".is a very efficient and widely used implementation of gradient boosted trees:
    • it adds a regularizer on the tree complexity
    • it uses a second order approximation of the loss instead of just a linear approximation
    • it samples features at internal nodes (as in random forests)
    • it uses various computer science methods (such as handling out-of-core computation for large datasets) to ensure scalability.
  • In more detail, XGBoost optimizes the following regularized objective

  • the regularizer

where is the number of leaves, and and are regularization coefficients.

05 深度学习与大模型

History of ANN

source: beamlab.org

Introduction

So far, we have learned:

  • linear input-output mapping

    • logistic regression:
      • multiclass case:
    • linear regression:
    • generalized linear models: Poisson dist. etc.
  • models/algorithms linear in parameters

    • feature transformation, by replacing with
    • polynomial transform (in ):

  • deep neural networks or DNNs
    • to endow the feature extractor with its own parameters,

  • and
  • repeat this process recursively, to create more and more complex functions

  • where is the function at layer

  • The term "DNN" actually encompasses a larger family of models, in which we compose differentiable functions into any kind of DAG (directed acyclic graph, 有向无环图), mapping input to output.

    • feedforward neural network (FFNN) or multilayer perceptron (MLP): the DAG is a chain.
    • A MLP is for "structured data" or "tabular data":
    • each column (feature) has a specific meaning
  • DNNs for “unstructured data
    • unstructured data”: images, text
      • the input data is variable sized
      • each individual element (e.g., pixel or word) is often meaningless on its own
    • DNNs
      • convolutional neural networks (CNN) images
      • recurrent neural networks (RNN) and transformers sequences
      • graph neural networks (GNN) graphs.

前馈神经网络(Feedforward Neural Networks, FNNs)

多层感知机(Multilayer perceptrons, MLPs)

  • perceptron (感知机)
    • is a deterministic version of logistic regression
    • the functional form:

    • : the heaviside step function, also known as a linear threshold function
    • perceptrons are very limited in what they can represent due to their linear decision boundaries

The XOR problem

  • to learn a function that computes the exclusive OR (异-或逻辑) of its two binary inputs
  • The truth table (真值表) for the function
0 0 0
0 1 1
1 0 1
1 1 0
  • It is clear that the data is not linearly separable, so a perceptron cannot represent this mapping.

  • this problem can be overcome by stacking multiple perceptrons on top of each other: multilayer perceptron (MLP)

  • first hidden unit (AND operation) computes :

    • functional form:

    • are the weights
    • is the bias (偏置)
    • will fire iff and are both on
  • the second hidden unit (OR operation) computes

  • the third unit computes the output

    • where is the NOT (logical negation) operation
    • the value

  • This is equivalent to the XOR function

An MLP can represent any logical function. However, we obviously want to avoid having to specify the weights and biases by hand. In the rest of this chapter, we discuss ways to learn these parameters from data.

Differentiable MLPs

  • MLPs are difficult to train
    • the heaviside function is non-differentiable
  • A natural solution: replacing the heaviside function with a differeentiable one (i.e. the activation function )
  • the big idea:
    • define the hidden units at each layer to be a linear transformation of the hidden units at the previous layer passed elementwise through this activation function
      • functional form

      • in scalar form,

    • pre-activations: The quantity that is passed to the activation function

      • .
  • backpropagation (反向传播算法):
    • compose of these functions together
    • compute the gradient of the output wrt the parameters in each layer using the chain rule
    • pass the gradient to an optimizer to minimize some training objective

Activation functions (激活函数)


linear activation function

  • , makes the whole model reduces to a regular linear model

nonlinear activation functions.

  • sigmoid (logistic) function
    • a smooth approximation to the Heaviside function used in a perceptron:

    • it saturates(饱和) at 1 for large positive inputs, and at 0 for large negative inputs
    • 因为Logistic函数的性质,使得装备了Logistic激活函数的神经元具有以下两点性质:
      • 其输出直接可以看作是概率分布,使得神经网络可以更好地和统计学习模型进行结合。
      • 其可以看作是一个软性门(Soft Gate),用来控制其它神经元输出信息的数量。
  • the tanh function(双曲正切函数)

    • Tanh 函数可以看作是放大并平移的 Logistic 函数,其值域是

  • has a similar shape
  • saturates at and
  • Tanh 函数的输出是零中心化的(Zero-Centered),而 Logistic 函数的输出恒大于 0。非零中心化的输出会使得其后一层的神经元的输入发生偏置偏移(Bias Shift),并进一步使得梯度下降的收敛速度变慢。
  • vanishing gradient problem
    • the gradient of the output wrt the input will be close to zero
    • any gradient signal from higher layers will not be able to propagate back to earlier layers
    • it makes it hard to train the model using gradient descent
  • rectified linear unit(修正线性单元) or ReLU
    • a non-saturating activation functions
    • make it possible to train very deep models
    • it "turns off" negative inputs, and passes positive inputs unchanged

Example models

MLPs can be used to perform classification and regression for many kinds of data. We give some examples below.

Try it for yourself via: https://playground.tensorflow.org

MLP for classifying data into 2 categories

an MLP with two hidden layers applied to a 2 d input vector (Figure )

  • the model:

  • is the final logit score, which is converted to a probability via the sigmoid (logistic) function.
  • layer 2 is computed by taking a nonlinear combination of the 4 hidden units in layer 1 , using
  • layer 1 is computed by taking a nonlinear combination of the 2 input units, using
  • By adjusting the parameters, , to minimize the negative log likelihood, we can fit the training data very well

MLP for image classification

  • "flatten" the 2 d input into 1 d vector: dimensional
  • NN structure: use 2 hidden layers with 128 units each, followed by a final 10 way softmax layer
  • training results
    • train for just two "epochs" (passes over the dataset)
    • test set accuracy of
    • the errors seem sensible, e.g., 9 is mistaken as a 3
    • Training for more epochs can further improve test accuracy

MLP for text classification

  • convert the variable-length sequence of words into a fixed dimensional vector
    • each is a one-hot vector of length
    • is the vocabulary size
  • the method
    • treat the input as an unordered bag of words
    • the first layer of the model is a embedding matrix , which converts each sparse -dimensional vector to a dense -dimensional embedding,
    • convert this set of -dimensional embeddings into a fixed-sized vector using global average pooling,
  • example:
    • a single hidden layer
    • a logistic output (for binary classification), we get

  • NN setting & the training result
    • vocabulary size:
    • embedding size of
    • hidden layer of size 16
    • we get on the validation set.

MLP for heteroskedastic regression

  • a model for heteroskedastic nonlinear regression
  • outputs:
    • .
  • The two heads:
    • the head, we use a linear activation,
    • the head, we use a softplus activation, .
  • linear heads and a nonlinear backbone, the overall model is given by

  • stochastic volatility model
    • properties
      • the mean grows linearly over time
      • seasonal oscillations
      • the variance increases quadratically
    • applications
      • financial data
      • global temperature of the earth

The importance of depth

  • an MLP with one hidden layer is a universal function approximator
  • deep networks work better than shallow ones
  • the benefit of learning via a compositional or hierarchical way
  • Example:
    • classify DNA strings
    • the positive class is associated with the regular expression AA??CGCG??AA
    • it will be easier to learn if the model first learns to detect the AA and CG “motifs” using the hidden units in layer 1
    • then uses these features to define a simple linear classifier in layer 2

The "deep learning revolution"

  • some successful stories about DNNs
    • automatic speech recognition (ASR)
    • ImageNet image classification benchmark: reducing the error rate from 26% to 16% in a single year
  • The “explosion” in the usage of DNNs
    • the availability of cheap GPUs (graphics processing units)
    • the growth in large labeled datasets
  • high quality open-source software libraries for DNNs

Connections with biology

  • McCulloch-Pitts model of the neuron (1943): , where

    • the inputs
    • the strength of the incoming connections
    • weighted (dendrites树突) sum of the inputs
    • threshold (action potential动作电位)
  • We can combine multiple such neurons together to make an artificial neural networks, ANNs

  • ANNs differs from biological brains in many ways, including the following:

  • Most ANNs use backpropagation to modify the strength of their connections while real brains do not use backprop

    • there is no way to send information backwards along an axon
    • they use local update rules for adjusting synaptic strengths
  • Most ANNs are strictly feedforward (前馈的), but real brains have many feedback connections

    • It is believed that this feedback acts like a prior
  • Most ANNs use simplified neurons consisting of a weighted sum passed through a nonlinearity, but real biological neurons have complex dendritic tree structures (see Figure 13.8), with complex spatio-temporal dynamics.

  • Most ANNs are smaller in size and number of connections than biological brains

  • Most ANNs are designed to model a single function while biological brains are very complex systems that implement different kinds of functions or behaviors

Backpropagation

  • backpropagation
    • simple linear chain of stacked layers: repeated applications of the chain rule of calculus
    • arbitrary directed acyclic graphs (DAGs): automatic differentiation or autodiff.

Forward vs reverse mode differentiation

  • Consider a mapping of the form
    • and
    • is defined as a composition of functions:

  • , and

  • The intermediate steps needed to compute are , and .

  • We can compute the Jacobian using the chain rule:

  • we only need to consider how to compute the Jacobian efficiently

Computation graphs

  • Modern DNNs can combine differentiable components in much more complex ways, to create a computation graph, analogous to how programmers combine elementary functions to make more complex ones.
  • The only restriction is that the resulting computation graph corresponds to a directed ayclic graph (DAG), where each node is a differentiable function of all its inputs.
  • example

  • We can compute this using the DAG in Figure 13.11, with the following intermediate functions:

    • we have numbered the nodes in topological order (parents before children)
    • During the backward pass, since the graph is no longer a chain, we may need to sum gradients along multiple paths. For example, since influences and , we have

    • We can avoid repeated computation by working in reverse topological order. For example,

    • In general, we use

where the sum is over all children of node , as shown in Figure . The gradient vector has already been computed for each child ; this quantity is called the adjoint. This gets multiplied by the Jacobian of each child.

Training neural networks

  • fit DNNs to data
  • The standard approach is to use maximum likelihood estimation, by minimizing the NLL:

Tuning the learning rate

It is important to tune the learning rate (step size), to ensure convergence to a good solution. (Section 8.4.3.)

Vanishing and exploding gradients

  • vanishing gradient problem(梯度消失): When training very deep models, the gradient become very small

  • exploding gradient problem(梯度爆炸): When training very deep models, the gradient become very large

  • consider the gradient of the loss wrt a node at layer :

    • is the Jacobian matrix
    • is the gradient at the next layer. If is constant across layers, it is clear that the contribution of the gradient from the final layer, , to layer will be . Thus the behavior of the system depends on the eigenvectors of .
  • The exploding gradient problem can be ameliorated by gradient clipping(梯度裁剪), in which we cap the magnitude of the gradient if it becomes too large, i.e., we use

    • This way, the norm of can never exceed , but the vector is always in the same direction as .
  • the vanishing gradient problem is more difficult to solve
    • Modify the the activation functions at each layer to prevent the gradient from becoming too large or too small
    • Modify the architecture so that the updates are additive rather than multiplicative
    • Modify the architecture to standardize the activations at each layer, so that the distribution of activations over the dataset remains constant during training
    • Carefully choose the initial values of the parameters

Non-saturating activation functions



reason for the gradient vanishe problem

  • setting: , where

  • for saturating activation functions

  • the gradient of the loss wrt the inputs (from an earlier layer)

  • the gradient of the loss wrt the inputs is

  • the gradient of the loss wrt the parameters is

  • if is near 0 or 1 , the gradients will go to 0 .
  • One of the keys to being able to train very deep models is to use non-saturating activation functions.
Name Definition Range Reference
Sigmoid
Hyperbolic tangent
Softplus [GBB11]
Rectified linear unit [GBB11;KSH12]
Leaky ReLU [MHN13]
Exponential linear unit [CUH16]
Swish [RZL17]
GELU [HG16]

ReLU

  • The most common is rectified linear unit (修正线性单元) or ReLU

  • The gradient has the following form:

  • the gradient will not vanish, as long a is positive.
    • suppose we use this in a layer to compute .
    • the gradient wrt the inputs has the form

  • the gradient wrt the parameters

  • the “dead ReLU” problem:
    • if the weights are initialized to be large and negative, then it becomes very easy for (some components of) to take on large negative values, and hence for to go to 0 .
    • This will cause the gradient for the weights to go to 0 .
    • The algorithm will never be able to escape this situation,
    • the hidden units (components of ) will stay permanently off.

Non-saturating ReLU

  • the leaky ReLU

    • .
    • The slope of this function is 1 for positive inputs, and for negative inputs, thus ensuring there is some signal passed back to earlier layers, even when the input is negative.
    • If we allow the parameter to be learned, rather than fixed, the leaky ReLU is called parametric ReLU
  • the Exponential Linear Unit, ELU (指数线性单元)

    • This has the advantage over leaky ReLU of being a smooth function.
  • SELU (self-normalizing ELU): A slight variant of ELU

    • by setting and to carefully chosen values, this activation function is guaranteed to ensure that the output of each layer is standardized (provided the input is also standardized)
    • This can help with model fitting.
  • Softplus函数[Dugas et al., 2001]

    • 可以看作是 rectifier 函数的平滑版本,其定义为:

Other choices

  • swish (do well on some image classification benchmarks)

    • also called SiLU (for Sigmoid Linear Unit)
    • 可看作一种软性门控机制:
      • 接近1时,门处于“开”状态,激活函数的输出近似于 本身
      • 接近0时,门处于“关”状态,激活函数的输出近似于0
  • Maxout单元

Maxout 单元 [Goodfellow et al., 2013] 也是一种分段线性函数。其他激活函数输入为上一层神经元的尽输入,Maxout的输入为上一层神经元的全部原始输入。每个Maxout单元有个权向量和偏置:

Maxout单元非线性函数定义为:

  • Gaussian Error Linear Unit, GELU

    • where is the cdf of a standard normal:

    • We can think of GELU as a "soft" version of ReLU, since it replaces the step function with the Gaussian cdf, .
    • the GELU can be motivated as an aptive version of dropout, where we multiply the input by a binary scalar mask, ), where the probability of being dropped is given by . Thus the expected output is

    • We can approximate GELU using swish with a particular parameter setting, namely

Residual connections

  • residual network or ResNet (残差网络)
    • One solution to the vanishing gradient problem for DNNs
    • this is a feedforward model in which each layer has the form of a residual block, defined by

      • is a standard shallow nonlinear mapping (e.g., linear-activation-linear).
      • The inner function computes the residual term or delta that needs to be added to the input to generate the desired output
      • it is often easier to learn to generate a small perturbation to the input than to directly predict the output.
  • A model with residual connections has the same number of parameters as a model without residual connections, but it is easier to train
    • gradients can flow directly from the output to earlier layers (Figure 13.15b)
    • the activations at the output layer can be derived in terms of any previous layer using

    • the gradient of the loss wrt the parameters of the 'th layer:

  • Thus we see that the gradient at layer depends directly on the gradient at layer in a way that is independent of the depth of the network.

Regularization

Early stopping

  • the heuristic of stopping the training procedure when the error on the validation set starts to increase
  • This method works because we are restricting the ability of the optimization algorithm to transfer information from the training examples to the parameters

Weight decay

  • impose a prior on the parameters, and then use MAP estimation.
  • It is standard to use a Gaussian prior for the weights and biases, .
  • This is equivalent to regularization of the objective.
  • this is called weight decay, since it encourages small weights, and hence simpler models, as in ridge regression

Dropout

  • randomly (on a per-example basis) turn off all the outgoing connections from each neuron with probability
  • Dropout can dramatically reduce overfitting and is very widely used.
  • it prevents complex co-adaptation of the hidden units.

Bayesian neural networks

  • Modern DNNs are usually trained using a (penalized) maximum likelihood objective to find a single setting of parameters.
  • with large models, there are often many more parameters than data points
  • there may be multiple possible models which fit the training data equally well, yet which generalize in different ways.
  • It is often useful to capture the induced uncertainty in the posterior predictive distribution

  • Bayesian neural network or BNN.
    • It can be thought of as an infinite ensemble of differently weight neural networks.
    • By marginalizing out the parameters, we can avoid overfitting.

卷积神经网络(Convolutional Neural Networks, CNNs)

全连接前馈神经网络vs卷积神经网络



  • 全连接前馈神经网络

    • 权重矩阵的参数非常多
  • 局部不变性特征

    • 自然图像中的物体都具有局部不变性特征
    • 尺度缩放、平移、旋转等操作不影响其语义信息。
    • 全连接前馈网络很难提取这些局部不变特征
  • 卷积神经网络(Convolutional Neural Networks,CNN)

    • 一种前馈神经网络
    • 受生物学上感受野(Receptive Field)的机制而提出的
      • 在视觉神经系统中,一个神经元的感受野是指视网膜上的特定区域,只有这个区域内的刺激才能够激活该神经元。
  • 卷积神经网络有三个结构上的特性:

    • 局部连接
    • 权重共享
    • 空间或时间上的次采样

卷积


  • 卷积经常用在信号处理中,用于计算信号的延迟累积。
  • 假设一个信号发生器每个时刻产生一个信号,其信息的衰减率为,即在个时间步长后,信息为原来的
    • 假设, ,
  • 时刻收到的信号为当前时刻产生的信息和以前时刻延迟信息的叠加

  • 滤波器(filter)或卷积核(convolution kernel)

给定一个输入信号序列和滤波器,卷积的输出为:

低频信息
高频信息

二阶微分

卷积扩展

步长(Stride) 是指卷积核在滑动时的时间间隔.图(a)给出了步长为2的卷积示例.

零填充(Zero Padding) 是在输入向量两端进行补零.图(b)给出了输入的两端各补一个零后的卷积示例.

卷积类型

  • 卷积的结果按输出长度不同可以分为三类:

    • 窄卷积:步长,两端不补零,卷积后输出长度为
    • 宽卷积:步长,两端补零,卷积后输出长度
    • 等宽卷积:步长,两端补零,卷积后输出长度
  • 早期的文献中,卷积一般默认为窄卷积

  • 目前的文献中,卷积一般默认为等宽卷积

二维卷积

  • 在图像处理中,图像是以二维矩阵的形式输入到神经网络中,因此我们需要二维卷积。
  • 一个输入信息和滤波器的二维卷积定义为:

卷积作为特征提取器

互相关

  • 算卷积需要进行卷积核翻转。
  • 卷积操作的目标:
    • 提取特征
    • 翻转是不必要的
  • 互相关

二维卷积的扩展

步长1,零填充0

步长2,零填充0

步长1,零填充1

步长2,零填充1

卷积神经网络 = 卷积层+汇聚层+全连接层

  • 卷积层代替全连接层

  • 卷积层的两个重要性质
    • 局部连接
    • 权重共享 ,

卷积层的典型结构

  • 输入特征映射组为三维张量,其中每个切片矩阵为一个输入特征映射,
  • 输出特征映射组为三维张量,其中每个切片矩阵为一个输出特征映射,
  • 卷积核为四维张量,其中每个切片矩阵为一个二维卷积核,

卷积层的映射关系

汇聚层



  • 卷积层虽然可以显著减少连接的个数,但是每一个特征映射的神经元个数并没有显著减少。

  • 汇聚层(Pooling Layer)也叫子采样层(Subsampling Layer),其作用是进行特征选择,降低特征数量,从而减少参数数量.

  • 常用的汇聚函数:

    • 最大汇聚(Maximum Pooling或Max Pooling):对于一个区域选择这个区域内所有神经元的最大活性值作为这个区域的表示

    • 平均汇聚(Mean Pooling):一般是取区域内所有神经元活性值的平均

卷积网络结构

  • 卷积网络是由卷积层、汇聚层、全连接层交叉堆叠而成。
    • 趋向于小卷积、大深度
    • 趋向于全卷积
  • 典型结构
    • 一个卷积块为连续个卷积层和个汇聚层(通常设置为)。一个卷积网络中可以堆叠个连续的卷积块,然后在接着个全连接层(的取值区间比较大,比如或者更大;一般为)。

其它卷积种类

  • 转置卷积(Transposed Convolution)/反卷积(Deconvolution)
    • 低维特征映射到高维特征
  • 微步卷积(Fractionally-Strided Convolution)
    • 低维特征映射到高维特征
  • 空洞卷积(Atrous Convolution)/膨胀卷积(Dilated Convolution)
    • 通过给卷积核插入“空洞”来变相地增加其大小。

典型的卷积网络:LeNet-5

  • LeNet-5 是一个非常成功的神经网络模型。
    • 基于 LeNet-5 的手写数字识别系统在 90 年代被美国很多银行使用,用来识别支票上面的手写数字。
    • LeNet-5 共有 7 层。

典型的卷积网络:AlexNet

  • 2012 ILSVRC winner(top 5 error of 16% compared to runner-up with 26% error)
    • 第一个现代深度卷积网络模型
    • 首次使用了很多现代深度卷积网络的一些技术方法
      • 使用GPU进行并行训练,采用了ReLU作为非线性激活函数,使用Dropout防止过拟合,使用数据增强
    • 5个卷积层、3个汇聚层和3个全连接层

典型的卷积网络:Inception网络(GoogLeNet)

  • 2014 ILSVRC winner (22层)
    • 参数:GoogLeNet:4M VS AlexNet:60M
    • 错误率:6.7%
    • Inception网络是由有多个inception模块和少量的汇聚层堆叠而成。

典型的卷积网络:Inception模块 v1

  • 在卷积网络中,如何设置卷积层的卷积核大小是一个十分关键的问题。
  • 在Inception网络中,一个卷积层包含多个不同大小的卷积操作,称为Inception模块。
  • Inception模块同时使用等不同大小的卷积核,并将得到的特征映射在深度上拼接(堆叠)起来作为输出特征映射。

典型的卷积网络:Inception模块 v3

  • 用多层小卷积核替换大卷积核,以减少计算量和参数量。
  • 使用两层3x3的卷积来替换v1中的的卷积
  • 使用连续的来替换的卷积。

典型的卷积网络:残差网络

  • 残差网络(Residual Network,ResNet)是通过给非线性的卷积层增加直连边的方式来提高信息的传播效率。
    • 假设在一个深度网络中,我们期望一个非线性单元(可以为一层或多层的卷积层)去逼近一个目标函数为
    • 将目标函数拆分成两部分:恒等函数和残差函数

  • 残差单元

典型的卷积网络:ResNet

  • 2015 ILSVRC winner (152层)
  • 错误率:3.57%

循环神经网络(Recurrent Neural Networks, RNNs)

前馈网络

  • 连接存在层与层之间,每层的节点之间是无连接的。(无循环)
  • 输入和输出的维数都是固定的,不能任意改变。无法处理变长的序列数据。
  • 假设每次输入都是独立的,也就是说每次网络的输出只依赖于当前的输入。

可计算问题

  • 有限状态自动机(Finite Automata)
  • 图灵机

给网络增加记忆能力:延时神经网络(Time Delay Neural Network,TDNN)

  • 建立一个额外的延时单元,用来存储网络的历史信息(可以包括输入、输出、隐状态等)


  • 这样,前馈网络就具有了短期记忆的能力。

给网络增加记忆能力:自回归模型(Autoregressive Model,AR)

  • 自回归模型:一类时间序列模型,用变量的历史信息来预测自己

    • 为第t个时刻的噪声
  • 有外部输入的非线性自回归模型(Nonlinear Autoregressive with Exogenous Inputs Model,NARX)

    • 其中表示非线性函数,可以是一个前馈网络,为超参数.

循环神经网络( Recurrent Neural Network ,RNN )

  • 循环神经网络通过使用带自反馈的神经元,能够处理任意长度的时序数据。

  • 循环神经网络比前馈神经网络更加符合生物神经网络的结构。
  • 循环神经网络已经被广泛应用在语音识别、语言模型以及自然语言生成等任务上
  • 作用:
    • 输入-输出映射:机器学习模型
    • 存储器:联想记忆模型

按时间展开

简单循环网络( Simple Recurrent Network , SRN )

  • 状态更新:

  • 一个完全连接的循环网络是任何非线性动力系统的近似器 。

图灵完备


  • 图灵完备(Turing Completeness)是指一种数据操作规则,比如一种计算机编程语言,可以实现图灵机的所有功能,解决所有的可计算问题。
  • 一个完全连接的循环神经网络可以近似解决所有的可计算问题。

参数学习

  • 机器学习
    • 给定一个训练样本,其中
      • 为长度是的输入序列,
      • 是长度为的标签序列。
  • 时刻t的瞬时损失函数为

  • 总损失函数

随时间反向传播算法






梯度消失/爆炸与长程依赖

  • 梯度

  • 其中

  • 长程依赖问题:由于梯度爆炸或消失问题,实际上只能学习到短周期的依赖关系。

  • 改进梯度爆炸问题

    • 权重衰减
    • 梯度截断
  • 改进梯度消失问题

    • 改进模型

长程依赖问题的改进方法

  • 循环边改为线性依赖关系

  • 增加非线性

长短期记忆神经网络(Long Short-Term Memory, LSTM )

LSTM的各种变体

  • 没有遗忘门

  • 耦合输入门和遗忘门

  • peephole连接

门控循环单元(Gated Recurrent Unit, GRU)

  • 重置门

  • 更新门

深层模型

堆叠循环神经网络

  • 将多个循环网络堆叠起来

双向循环神经网络

  • 增加一个按照时间的逆序来传递信息的网络层,来增强网络的能力.

循环网络应用

  • 传统统计机器翻译

    • 源语言:
    • 目标语言:
      • 模型:

      • : 翻译模型
      • : 语言模型
  • 基于序列到序列的机器翻译

    • 一个RNN用来编码
    • 另一个RNN用来解码
  • 看图说话

<center> <img align="center" style="padding-right:10px;" width=30% src="fig/wechat.jpg"> </center>