性质	统计推断	监督机器学习
目标	具有解释力的因果模型	预测表现，往往解释力有限
数据	数据由模型生成	数据生成过程未知
框架	概率	算法和概率
表达能力	通常是线性的	非线性
模型选择	基于信息准则	数值优化
可扩展性	仅限于低维数据	缩放至高维输入数据
稳健性	容易出现过度拟合	专为样本外性能而设计
诊断	广泛	有限

- **The probabilistic approach** - treat all unknown quantities as random variables - it is the optimal approach to decision making under uncertainty <br> >Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking fundamental. It is, of course, not the only view. But it is through this view that <font color="red">we can connect what we do in machine learning to every other computational science, whether that be in stochastic optimisation, control theory, operations research, econometrics, information theory, statistical physics or bio-statistics.</font> For this reason alone, mastery of probabilistic thinking is essential. <p align="right">---Shakir Mohamed, DeepMind<p>

--- <div align="center"> <table rules="none"> <tr> <td> <div style="width: 300"> ### 聚类（Clustering） - 在数据中查找**聚类**：将输入划分为包含“相似”点的区域。 </div> </td> <td> <div style="width: 500pt"> <img align="center" style="padding-right:10px;" width=100% src="../myfig/L01/L01-clustering.png"> </div> </td> </tr> </table> </div>

--- ## 7. Python实现案例：基础实现 ```python # PCR基础实现 from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler import numpy as np import matplotlib.pyplot as plt # 创建管道 pcr_model = make_pipeline( StandardScaler(), PCA(n_components=5), # 选择主成分数量 LinearRegression() ) # 拟合模型 pcr_model.fit(X_train, y_train) # 预测 predictions = pcr_model.predict(X_test) ``` --- ## 7. Python实现案例：组件选择 ```python # 使用交叉验证选择最佳主成分数量 from sklearn.model_selection import cross_val_score from sklearn.metrics import mean_squared_error # 测试不同的主成分数量 n_components = range(1, X.shape[1] + 1) cv_scores = [] for n in n_components: pcr_model = make_pipeline( StandardScaler(), PCA(n_components=n), LinearRegression() ) # 使用交叉验证评估性能 scores = -cross_val_score(pcr_model, X, y, cv=5, scoring='neg_mean_squared_error') cv_scores.append(scores.mean()) # 绘制结果 plt.figure(figsize=(10, 6)) plt.plot(n_components, cv_scores, marker='o') plt.xlabel('主成分数量') plt.ylabel('MSE') plt.title('PCR: 主成分数量与MSE关系') ``` --- ## 7. Python实现案例：完整示例 ```python # 完整PCR实现示例 import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import mean_squared_error, r2_score # 1. 加载数据并划分 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 2. 标准化 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 3. PCA pca = PCA() X_train_pca = pca.fit_transform(X_train_scaled) # 4. 检查累积方差 explained_variance = pca.explained_variance_ratio_ cumulative_variance = np.cumsum(explained_variance) optimal_n = np.argmax(cumulative_variance >= 0.95) + 1 # 解释95%方差所需的成分数 # 5. 使用选择的组件数构建模型 pca = PCA(n_components=optimal_n) X_train_pca = pca.fit_transform(X_train_scaled) X_test_pca = pca.transform(X_test_scaled) # 6. 拟合回归模型 regressor = LinearRegression() regressor.fit(X_train_pca, y_train) # 7. 预测 y_train_pred = regressor.predict(X_train_pca) y_test_pred = regressor.predict(X_test_pca) # 8. 评估模型 train_mse = mean_squared_error(y_train, y_train_pred) test_mse = mean_squared_error(y_test, y_test_pred) test_r2 = r2_score(y_test, y_test_pred) print(f"使用的主成分数: {optimal_n}") print(f"训练集MSE: {train_mse:.4f}") print(f"测试集MSE: {test_mse:.4f}") print(f"测试集R²: {test_r2:.4f}") ```

	SVC	SVM
内积/核
函数形式
函数形式

方法	方法特点	是否考虑Y进行转换？	最佳应用场景
PCR	使用PCA后回归	否	多个相关预测变量
PLS	创建与Y相关的成分	是	当Y应引导降维时
岭回归	收缩所有系数	否	当所有预测变量都应保留
Lasso	将部分系数收缩至零	否	需要特征选择时

	有时数据不可分离
	有时最大间隔分类器对噪声数据非常敏感

01 数字技术与金融工程前沿

机器学习与金融工程（浅层学习）

内容概要

什么是机器学习？

使用机器学习模型的一般流程

过拟合（Overfitting）和泛化（generalization）

评估 ML 算法性能错误和过度拟合

learning curve

拟合曲线

评估 ML 算法性能：防止监督学习中的过拟合

No free lunch theorem

监督学习

统计推断与监督学习

回归（Regression）

(简单) 线性回归

拟合优度（goodness of fit）

岭回归（Ridge）

Lasso回归

弹性网络（Elastic net)

多项式回归（Polynomial regression）

样条回归：阶梯函数

样条回归：基函数

样条回归：分段多项式

样条回归：约束和样条

样条回归：样条基表示

与多项式回归的比较

平滑样条

局部回归（Local regression）

算法

广义加性模型（GAM）

稳健线性回归（Robust linear regression）

偏最小二乘法(PLS)概述

PLS工作原理

PLS与其他方法比较

PLS在经济、管理、金融领域的应用

PLS优缺点与实施建议

回归算法选择建议

分类（Classification）

示例：对鸢尾花(Iris)进行分类

探索性数据分析

学习分类器

经验风险最小化(Empirical risk minimization)

不确定性（Uncertainty）

极大似然估计（Maximum likelihood estimation）

逻辑回归（Logistic regression）

估计与预测

多元逻辑回归（Multiple logistic regression）

预测

多项逻辑回归（Multinomial logistic regression）

生成式分类模型

生成式分类模型：线性判别分析(LDA)（）

例子

生成式分类模型：LDA（）

生成式分类模型：二次判别分析(QDA)

生成式分类模型：朴素贝叶斯

广义加性模型（GAM）

支持向量机：超平面

使用分离超平面进行分类

分离超平面

最大间隔分类器

不可分离情况和噪声数据

支持向量分类器（SVC）

参数

支持向量机（SVM）

利用多项式特征的非线性分类器

核函数

SVC vs. SVM

多类别SVM

与逻辑回归的关系

分类算法选择建议

基于树的模型

模型定义

模型拟合

正则化

CART的优缺点

优点

缺点

集成学习

Stacking

集成学习vs.贝叶斯模型平均（BMA）

主成分回归（PCR）：优缺点