Lecture 05

Big Data in Finance: Text and Image Analytics

In the age of big data, the analyst who can extract signal from noise—whether in words or pixels—wins.

Financial Machine Learning · Lecture 05

Outlines

Financial Machine Learning · Lecture 05

Part 1 · Introduction: Big Data and Finance Overview

  • Big Data Concepts and Financial Data Landscape
  • Alternative Data Types: Focus on Text and Images
  • Opportunities, Risks, and Governance
Financial Machine Learning · Lecture 05

Big Data: The 5V Framework

Five dimensions of big data:

  • Volume — Terabytes to petabytes of data
  • Velocity — Real-time streaming and updates
  • Variety — Structured, unstructured, semi-structured
  • Veracity — Data quality and reliability
  • Value — Actionable insights for decisions
pie title Financial Data Mix
    "Structured" : 20
    "Text" : 45
    "Images" : 25
    "Other" : 10
Financial Machine Learning · Lecture 05

Structured vs Unstructured Data

Type Examples Challenges
Structured Prices, volumes, financials Limited information scope
Unstructured Text News, filings, social media Ambiguity, context-dependence
Unstructured Images Satellite, documents, charts High dimensionality, noise

Key insight: ~80% of financial data is unstructured.
Source: Consensus estimates from IDC, Gartner, and Merrill Lynch.

Financial Machine Learning · Lecture 05
Financial Machine Learning · Lecture 05

Overall Framework of Financial Big Data Analysis

  • Complete Data Analysis Workflow

    graph LR A[Problem Definition] --> B[Data Acquisition] B --> C[Data Preprocessing] C --> D[Feature Engineering] D --> E[Model Building] E --> F[Result Evaluation] F --> G[Decision Support] F -.-> A
  • Typical Application Scenarios in Finance

    Application Scenario Data Type Common Methods
    Risk Management Market data, text, images Predictive modeling, anomaly detection
    Investment Decisions Financial reports, news, social media Sentiment analysis, topic modeling
    Fraud Detection Transaction records, behavioral data Graph neural networks, time-series analysis
    Market Forecasting Price data, macro indicators, text Deep learning, reinforcement learning

Financial Machine Learning · Lecture 05

Alternative Data Types

Text Data Sources:

  • Financial news and wire services
  • SEC filings (10-K, 10-Q, 8-K)
  • Earnings call transcripts
  • Social media (Twitter, Reddit)
  • Central bank communications

Image Data Sources:

  • Satellite and aerial imagery
  • Scanned financial documents
  • Trading interface screenshots
  • Street view and consumer scenes
  • Chart and graph images
Financial Machine Learning · Lecture 05

Why Text and Images Carry Alpha

Information advantage from alternative data:

  • Forward-looking signals — Sentiment precedes price moves
  • Non-priced information — Not yet in market consensus
  • Behavioral insights — Reveal investor psychology
  • Real-time updates — Faster than official releases

Example: Hedge funds combine news sentiment + satellite data for macro nowcasting (Source: Katona et al., 2022)

Financial Machine Learning · Lecture 05

Opportunities, Risks, and Governance

Opportunities:

  • Enhanced return prediction
  • Improved risk management
  • Automated compliance monitoring
  • Real-time market surveillance

Risks & Challenges:

  • Model risk — Overfitting, non-stationarity
  • Data risk — Bias, leakage, quality issues
  • Regulatory — Privacy, explainability
  • Ethical — Fairness, transparency
Financial Machine Learning · Lecture 05

Regulatory Landscape for AI in Finance

Key regulatory considerations:

  • EU AI Act — Risk-based classification of AI systems
  • SEC guidance — Disclosure of AI in trading strategies
  • ESMA — Algorithmic trading requirements
  • Model Risk Management — SR 11-7 guidelines

Best practices:

  • Document model development and validation
  • Ensure human oversight in critical decisions
  • Monitor for drift and performance degradation
Financial Machine Learning · Lecture 05

Part 2 · Text Analytics in Finance

  • Landscape of Financial Text Data and NLP Pipeline
  • Bag of Words, TF-IDF, and Text Regression
  • Word Embeddings, Topics, and Beyond
  • Financial Applications: Asset Pricing, Risk, and Policy
Financial Machine Learning · Lecture 05

Financial Text Data: Types, Sources, and Analytical Value

Main categories of financial text data:

  • Corporate disclosures and reports: Annual & quarterly reports, earnings forecasts
  • Regulatory documents: SEC filings, policies, regulations
  • Professional news and analysis: Bloomberg, Reuters, financial media
  • Social media content: Twitter, Xueqiu, Reddit, etc.
  • Central bank communications: Monetary policy reports, FOMC minutes

Analytical value of financial text data:

  • Extract market sentiment and investor expectations
  • Identify potential risks and opportunities
  • Quantify qualitative information for investment decisions
  • Predict stock price movements and market trends

Key data sources and applications:

Source Update Frequency Key Use Cases
News wires Real-time Sentiment analysis, event detection
SEC filings Quarterly / Annual Risk factor extraction, MD&A analysis
Earnings calls Quarterly Tone analysis, management guidance
Social media Continuous Retail sentiment, rumor tracking
Central bank communications Scheduled Policy expectation analysis
Financial Machine Learning · Lecture 05

NLP 1.0: The Three-Step Roadmap

Classic NLP pipeline for financial text:

Raw Text → Numerical Representation → Information Retrieval → Analysis
  1. Numerical representation — Transform documents to vectors
  2. Information retrieval — Dimensionality reduction, selection
  3. Causal/predictive analysis — Regression, classification

This pipeline underlies most econometric text analysis (Source: Gaillac & L'Hour, 2024)

graph TD A[金融文本分析] --> B[基础处理方法] A --> C[表示学习方法] A --> D[高级分析方法] B --> B1[分词与词性标注] B --> B2[停用词过滤] B --> B3[词形还原] B --> B4[实体识别] C --> C1[词袋模型/TF-IDF] C --> C2[词嵌入] C --> C3[预训练语言模型] C2 --> C2A[Word2Vec] C2 --> C2B[GloVe] C3 --> C3A[BERT] C3 --> C3B[FinBERT] C3 --> C3C[大语言模型] D --> D1[情感分析] D --> D2[主题模型] D --> D3[事件抽取] D --> D4[因果关系分析]
Financial Machine Learning · Lecture 05

Text Preprocessing Techniques and Pipeline

1. Text Cleaning (Normalization)

  • Remove HTML tags, special characters, numbers
  • Standardize case, detect URLs
  • Spelling correction and normalization

2. Tokenization

  • English: Split by spaces & punctuation
  • Chinese: Dictionary / statistical segmentation (e.g., jieba, THULAC)
  • Handle domain-specific terms (“CPI”, “Quantitative Easing”)

3. Stop-word Removal

  • Generic stop words (“the”, “is”, “and”)
  • Domain-specific financial stop words (“company”, “share”)

4. Stemming and Lemmatization

  • Stemming:
    • "running" → "run", "jumped" → "jump"
    • Algorithms: Porter, Snowball
  • Lemmatization:
    • "better" → "good", "mice" → "mouse", "is/are/was" → "be"
    • Dictionary‑based reduction

Pipeline Summary

Step Description Example
Normalization Case & format standardization "Apple" → "apple"
Tokenization Split text into tokens "stock price" → ["stock","price"]
Stop-word Removal Filter uninformative words Remove "the", "is", "and"
Stemming
/ Lemmatization
Reduce to base form "running" → "run"

Common Tools: NLTK, spaCy, scikit‑learn

Financial Machine Learning · Lecture 05

Document-Term Matrix (DTM)

Bag-of-words representation:

  • Assumes word order doesn't matter
  • Each document = vector of word counts
  • Matrix dimension: (documents × vocabulary)

Example DTM:

risk growth profit loss
Doc 1 3 1 0 2
Doc 2 0 4 2 0
Doc 3 5 0 1 3

Challenge: Extremely sparse and high-dimensional

Financial Machine Learning · Lecture 05

TF-IDF Weighting

Term Frequency-Inverse Document Frequency:

Where:

  • = frequency of term in document
  • = total number of documents
  • = number of documents containing term

Effect: Upweights rare, discriminative words; downweights common words

Financial Machine Learning · Lecture 05

Cosine Similarity

Measuring document similarity:

Properties:

  • Range:
  • Aligned vectors → score = 1
  • Orthogonal vectors → score = 0
  • Opposite vectors → score = −1

Why preferred in NLP? Focuses on direction (topic), not magnitude (length)

Financial Machine Learning · Lecture 05

Financial Sentiment Dictionaries

Domain-specific lexicons for finance:

Dictionary Description Example Words
Loughran-McDonald Finance-specific sentiment "liability", "litigation" (−)
Harvard GI General sentiment "good" (+), "bad" (−)
VADER Social media optimized Handles emojis, slang

Key insight: General dictionaries misclassify financial terms

Financial Machine Learning · Lecture 05

Loughran & McDonald (2011)

Financial Machine Learning · Lecture 05

参考文献:When Is a Liability Not a Liability Textual Analysis, Dictionaries, and 10-Ks

Financial Machine Learning · Lecture 05

Text Regression Framework

Using text features for prediction:

Where:

  • = outcome (returns, risk, default)
  • = document-term vector for firm/period
  • High-dimensional: explanatory variables

Solutions to dimensionality:

  • Penalized regression (LASSO, Ridge, Elastic Net)
  • Random forests for variable selection
  • Sentiment aggregation using dictionaries
Financial Machine Learning · Lecture 05

High‑Dimensional Text / Factor Regression: General Pipeline

  • Start from a corpus of documents and a vocabulary of tokens; pre‑process (normalization, tokenization, stop‑words, stemming/lemmatization) so each document becomes a sequence
  • Build a high‑dimensional representation:
    • Document–term matrix with counts or TF–IDF
    • Or lower‑dimensional factors from SVD, topics, or embeddings
  • Optionally compress: select columns (dictionary, sentiment index) or average token embeddings into document‑level vectors
  • Stack text features with structured covariates to form or
  • Estimate a predictive or causal model, e.g.
    • or
    • Fit with high‑dimensional ML (penalized regression, trees, nets), evaluate out‑of‑sample, then interpret coefficients or factors.
Financial Machine Learning · Lecture 05

High‑Dimensional Sparse Modeling and Cross‑Fitting

  • In text/factor regressions, the feature dimension (words, n‑grams, topics, embeddings) often exceeds sample size ; the document–term matrix is high‑dimensional and sparse
  • OLS is ill‑posed, so we use sparse modeling:
    • Lasso / elastic net to select a small set of informative features
    • Penalized likelihood (e.g., penalized logit) for classification
    • Regularization controls variance and overfitting when
  • Dimensionality reduction acts as structured sparsity:
    • Low‑rank factor models (SVD) or topic models (LDA) factor into a few latent factors
  • For causal parameters, combine ML with cross‑fitting:
    • Split data into folds; estimate nuisance components (text‑based controls, propensities, outcome models) on other folds
    • Plug predictions into orthogonal / debiased estimating equations on the held‑out fold
    • Aggregating across folds delivers valid inference in big‑data settings.
Financial Machine Learning · Lecture 05

Prediction vs Causality with Big, High‑Dimensional Data

  • The pipeline is naturally predictive: turn text into features and estimate with flexible ML, assessing performance by out‑of‑sample loss (MSE, AUC, etc.)
  • Causal questions instead target structural effects (policy, treatment, latent concept). Text can serve as:
    • High‑dimensional controls for confounding
    • Proxies for latent variables (tone, ideology, uncertainty)
  • High‑dimensional ML helps approximate nuisance functions but does not create identification; assumptions still come from design (unconfoundedness, instruments, natural experiments)
  • Risks of confusing prediction and causality:
    • Highly predictive text features may capture selection or anticipation, not causal channels
    • Overfitting can generate spurious “important words”
  • A principled workflow separates:
    • ML for prediction/adjustment (controls, propensities)
    • Econometric methods for causal parameters, with orthogonal moments and cross‑fitting to reduce bias.
Financial Machine Learning · Lecture 05

Case: News Sentiment and Stock Returns

Research finding:

Real-time sentiment indices from news and social media predict:

  • Intraday volatility
  • Short-term return reversals
  • Post-earnings drift

Effect strongest during market stress periods.

Trading application:

Sentiment-based strategies outperform around earnings announcements.

Financial Machine Learning · Lecture 05

Limitations of One-Hot Encoding

Problems with sparse word vectors:

  1. Inefficient storage — Dimension per document
  2. No semantic similarity — All distinct words equidistant

  1. Curse of dimensionality — Joint probability estimation fails

Solution: Learn dense, low-dimensional word embeddings

Financial Machine Learning · Lecture 05

Word Embeddings: Distributional Hypothesis

Core idea:

"You shall know a word by the company it keeps" — J.R. Firth

Word embeddings:

  • Dense vectors in low-dimensional space ()
  • Similar words → similar vectors
  • Mathematical relationships ≈ linguistic meaning

Famous example:

Financial Machine Learning · Lecture 05

Word2Vec: Skip-gram Model

Predict context from target word:

Training objective: Maximize probability of observed context words

Context: "The [___] reported quarterly earnings"
Target: "company"

Computational trick: Negative sampling avoids full vocabulary sum

Financial Machine Learning · Lecture 05

Word2Vec: CBOW Model

Predict target from context (inverse of Skip-gram):

Where = average of context word vectors

Comparison:

Model Approach Best For
Skip-gram Target → Context Rare words, small data
CBOW Context → Target Frequent words, fast training
Financial Machine Learning · Lecture 05

Word2Vec Example

import numpy as np
from gensim.models import Word2Vec
import matplotlib.pyplot as plt

# 准备训练语料
sentences = [
    ['机器学习', '是', '人工智能', '的', '重要', '分支'], ['深度学习', '是', '机器学习', '的', '高级', '方法'],
    ['神经网络', '是', '深度学习', '的', '基础', '架构'], ['人工智能', '正在', '快速', '发展'],
    ['数据科学', '依赖', '机器学习', '技术']
]

# 训练Word2Vec模型
model = Word2Vec(
    sentences, 
    vector_size=5,    # 降低维度
    window=3,         # 上下文窗口大小
    min_count=1,      # 最小词频
    epochs=100        # 训练轮数
)

# 查看词向量
print("'机器学习'的词向量:")
print(model.wv['机器学习'])

# 词语相似度
print("\n与'机器学习'最相似的词:")
similar_words = model.wv.most_similar('机器学习', topn=5)
for word, score in similar_words:
    print(f"{word}: {score}")

# 获取所有词语和对应向量
words = list(model.wv.key_to_index.keys())
vectors = [model.wv[word] for word in words]

# 简单的2D散点图
plt.figure(figsize=(10, 8))
# 只使用前两个维度
x = [v[0] for v in vectors]
y = [v[1] for v in vectors]
plt.scatter(x, y)

# 标注词语
for i, word in enumerate(words):
    plt.annotate(word, (x[i], y[i]))

plt.title('词向量简单可视化')
plt.xlabel('第一维')
plt.ylabel('第二维')
plt.show()

# 词向量运算
try:
    result = model.wv.most_similar(
        positive=['人工智能', '技术'], 
        negative=['机器学习']
    )
    print("\n语义推理:")
    for word, score in result:
        print(f"{word}: {score}")
except Exception as e:
    print("语义推理可能需要更大的语料库")
'机器学习'的词向量:
[-0.01202206  0.00593786  0.10435627  0.17965294 -0.18674973]

与'机器学习'最相似的词:
数据科学: 0.9527133703231812
的: 0.4597879648208618
发展: 0.33608755469322205
是: 0.21078188717365265
人工智能: 0.11742815375328064
Financial Machine Learning · Lecture 05

Mikolov et al. (2013)

Financial Machine Learning · Lecture 05

参考文献:Efficient Estimation of Word Representations in Vector Space

Financial Machine Learning · Lecture 05

Topic Models: Latent Dirichlet Allocation

Discovering latent themes in document collections:

  • Each document = mixture of topics
  • Each topic = distribution over words
  • Unsupervised — no labeled data needed

Generative process:

  1. For each document, draw topic proportions
  2. For each word position, sample topic
  3. Sample word from topic-word distribution

Estimation: Gibbs sampling or Variational EM

Financial Machine Learning · Lecture 05

Topic Modeling Techniques for Financial Texts

Hybrid Topic Models

  • Correlated Topic Model (CTM)
    • Feature: captures correlations among topics
    • Financial use: analyzing interrelations among risk factors
  • Structural Topic Model (STM)
    • Feature: integrates metadata effects (e.g., time, source)
    • Financial use: studying topic variation across market phases
  • Neural Topic Models
    • Combine neural networks with traditional topic modeling
    • Examples: Neural-LDA, ProdLDA

Financial Applications

  • Identifying and tracking central bank policy themes
  • Analyzing risk factors in corporate annual reports
  • ESG topic evolution analysis
  • Public sentiment monitoring and investor mood quantification
  • Extracting and quantifying themes from analyst reports

BERTopic Code Example

from bertopic import BERTopic  
import pandas as pd  
# Load financial news data  
df = pd.read_csv("financial_news.csv")  
docs = df['content'].tolist()  
# Create and train a BERTopic model  
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")  
topics, probs = topic_model.fit_transform(docs)  
# Retrieve topic representations  
topic_info = topic_model.get_topic_info()  
print(topic_model.get_topic(0))  # View keywords of topic 0  
Financial Machine Learning · Lecture 05

参考文献:Business News and Business Cycles

Financial Machine Learning · Lecture 05

Case: FOMC Transparency Study

Research question: Does transparency change Fed deliberations?

Method:

  • LDA topic model on 46,169 FOMC documents
  • Natural experiment: 1993 tape revelation

Finding:

  • After transparency increased, committee members showed conformist behavior
  • Less diverse opinions relative to chairman's view

Implication: Transparency may reduce deliberation quality

Source: Hansen, McMahon & Prat, 2017

Financial Machine Learning · Lecture 05

Hansen, McMahon & Prat (2017)

Financial Machine Learning · Lecture 05

参考文献:Transparency and Deliberation Within the FOMC A Computational Linguistics Approach

Financial Machine Learning · Lecture 05

Embedding Applications in Finance

Financial research using word embeddings:

Study Method Finding
Hoberg & Phillips (2016) 10-K cosine similarity Data-driven industry definitions
Kozlowski et al. (2019) Cultural embeddings Gender/class associations in text
Ash et al. (2025) Judge embeddings Judicial sexism measurement

Key insight: Embeddings capture latent concepts not explicit in text

Financial Machine Learning · Lecture 05

Financial Text Applications Overview

Asset Pricing:

  • News sentiment → return prediction
  • Earnings call tone → post-EA drift
  • Analyst report language → recommendations

Risk Management:

  • 10-K risk factors → volatility forecasting
  • Loan applications → credit scoring
  • Social media → fraud detection
Financial Machine Learning · Lecture 05

Kozlowski et al. (2019)

Financial Machine Learning · Lecture 05

参考文献:The Geometry of Culture Analyzing the Meanings of Class through Word Embeddings

Financial Machine Learning · Lecture 05

Ash et al. (2025)

Financial Machine Learning · Lecture 05

参考文献:Ideas Have Consequences The Impact of Law and Economics on American Justice

Financial Machine Learning · Lecture 05

Case: Earnings Calls and Firm Uncertainty

Building uncertainty indices from transcripts:

  1. Extract MD&A sections from earnings calls
  2. Identify uncertainty language (hedging, modal verbs)
  3. Construct firm-level uncertainty index
  4. Validate against realized volatility

Results:

  • Textual uncertainty predicts future stock volatility
  • Incremental to standard risk measures
  • Useful for options pricing and risk management
Financial Machine Learning · Lecture 05

FinBERT: Domain-Specific Language Model

BERT fine-tuned on financial texts:

  • Training corpus: News, SEC filings, analyst reports
  • 5-10% accuracy improvement over general BERT

Variants:

Model Specialization
FinBERT-tone Sentiment analysis
FinBERT-SEC Regulatory filings
FinBERT-ESG ESG disclosure analysis

Note: Transformer details in subsequent lectures

Financial Machine Learning · Lecture 05

Practical Challenges in Financial NLP

Key implementation issues:

  • Labeling — Expert annotation is expensive
  • Domain adaptation — General models underperform
  • Language drift — Financial terminology evolves
  • Evaluation — Ground truth often unavailable

Best practices:

  • Use domain-specific preprocessing
  • Validate on out-of-sample financial data
  • Monitor model performance over time
Financial Machine Learning · Lecture 05

NLP 2.0: The LLM Revolution in Finance

From task‑specific models → general financial intelligence

Core paradigm shift:

  • Pre‑train + Fine‑tune on massive financial corpora
  • Prompt Engineering replaces feature engineering
  • Generalization across tasks without retraining
  • Retrieval‑Augmented Generation (RAG) links text + data

NLP 1.0 vs NLP 2.0

Dimension NLP 1.0 NLP 2.0 (LLMs)
Architecture Bag‑of‑Words / LSTM Transformer
Training Supervised per task Self‑supervised pre‑train
Input Hand‑crafted features Raw text + prompts
Output Single‑task prediction Multi‑task generation
Knowledge Task‑specific World + domain knowledge

Financial LLM Use Cases:

  • Regulatory analysis
    Prompt: Summarize risk factors in 10‑K filing
  • Earnings call Q&A
    Prompt: Extract bearish tone from Q3 transcript
  • ESG report drafting
    Prompt: Generate sustainability section from metrics

Examples:
BloombergGPT (2023), FinGPT (2024), DeepFinLLM 2.0 (2025)

Financial Machine Learning · Lecture 05

Part 3 · Image Analytics in Finance

  • Foundations of Image Data and Computer Vision
  • Remote Sensing and Satellite Imagery
  • Document Image Analysis and OCR
  • Image-Based Risk, Fraud, and ESG Analytics
Financial Machine Learning · Lecture 05

What is an Image? Data Perspective

Image as numerical array:

  • Pixels — Basic unit of image information
  • Channels — RGB (3), Grayscale (1), Multispectral (N)
  • Resolution — Width × Height × Channels

Example: 1024×768 RGB image = 2.36 million values

Financial image types:

  • Satellite imagery (multispectral)
  • Scanned documents (grayscale/binary)
  • Charts and graphs (RGB)
Financial Machine Learning · Lecture 05

Core Computer Vision Tasks

Task Description Financial Application
Classification Assign image to category Document type identification
Detection Locate objects in image Car counting in parking lots
Segmentation Pixel-level labeling Chart region extraction
Recognition Identify specific instances Face verification for KYC
Financial Machine Learning · Lecture 05

Convolutional Neural Networks (CNNs)

Key components:

  1. Convolutional layers — Extract local features with filters
  2. Pooling layers — Reduce spatial dimensions
  3. Activation (ReLU) — Introduce nonlinearity
  4. Fully connected — Final classification/regression

Advantages:

  • Local receptive field — Captures spatial patterns
  • Weight sharing — Parameter efficiency
  • Hierarchical features — Low → high level abstraction
  • Translation invariance — Position-independent detection
Financial Machine Learning · Lecture 05

CNN Architecture Intuition

Input Image → [Conv → ReLU → Pool] × N → Flatten → FC → Output

Layer progression:

Layer Learns
Early conv Edges, textures
Middle conv Shapes, patterns
Late conv Objects, scenes
FC layers Task-specific decisions

Popular architectures: VGG, ResNet, EfficientNet, Inception

Financial Machine Learning · Lecture 05

Transfer Learning for Financial Images

Leveraging pretrained models:

  1. Feature extraction — Freeze pretrained layers, train new classifier
  2. Fine-tuning — Unfreeze top layers, retrain on financial data

Why transfer learning?

  • Financial image datasets are small
  • ImageNet features generalize to many domains
  • Dramatically reduces training time and data needs

Challenge: Financial images differ from natural images

  • Solution: Gradual unfreezing of layers
Financial Machine Learning · Lecture 05

Financial Image Data Types

Market & Trading:

  • K-line/candlestick charts
  • Heatmaps and treemaps
  • Order book visualizations

Documents:

  • Financial statements
  • Invoices and receipts
  • Contracts and agreements

Remote Sensing:

  • Satellite imagery
  • Aerial/drone photos
  • Night light data

Biometric & Security:

  • ID documents
  • Face images for KYC
  • Signature verification
Financial Machine Learning · Lecture 05

Satellite Imagery for Economic Signals

Capturing real economic activity from space:

Data Source Economic Indicator
Night lights (VIIRS, DMSP) GDP, urbanization
Parking lots Retail sales, foot traffic
Oil tank shadows Crude inventory levels
Shipping traffic Trade flows, supply chain
Agricultural land Crop yields, commodity prices

Advantage: Real-time, unbiased, global coverage

Financial Machine Learning · Lecture 05

Case: Parking Lot Car Counting

Method:

  1. Acquire daily satellite images
  2. Apply object detection (cars)
  3. Aggregate counts by retailer
  4. Predict quarterly sales

Results:

  • Signals arrive 2-4 weeks early
  • Predicts earnings surprises
  • Significant abnormal returns

Trading strategy:

Hedge funds analyze 4.8M images across 67,000 retail stores.

Accurate sales prediction enables profitable positioning.

Source: Katona et al., 2025

Financial Machine Learning · Lecture 05

Katona et al. (2025)

Financial Machine Learning · Lecture 05

参考文献:On the Capital Market Consequences of Big Data Evidence from Outer Space

Financial Machine Learning · Lecture 05

Satellite Image Analysis Pipeline

End-to-end workflow:

Image Acquisition → Tiling → Preprocessing → Feature Extraction → Aggregation

Steps:

  1. Acquisition — Commercial providers (Planet, Maxar)
  2. Tiling — Split large images into manageable patches
  3. Preprocessing — Cloud removal, normalization
  4. Feature extraction — CNN or manual features
  5. Aggregation — Temporal and spatial aggregation

Challenges: Weather effects, acquisition frequency, spatial alignment

Financial Machine Learning · Lecture 05

Case: Oil Tank Inventory Monitoring

Predicting crude oil inventories:

  • Floating-roof tanks cast measurable shadows
  • Shadow length indicates fill level
  • Daily imagery → continuous inventory estimates

Application:

  • Predict EIA weekly inventory reports
  • Trade crude oil futures ahead of announcements
  • Monitor geopolitical supply disruptions

Accuracy: 2-3 day lead time over official data

Financial Machine Learning · Lecture 05

Document Image Analysis Overview

Processing scanned financial documents:

Stage Task Methods
Acquisition Scanning, photography Mobile capture, bulk scanners
Preprocessing Deskew, denoise, binarize Image processing techniques
OCR Text extraction Tesseract, cloud APIs
Layout analysis Structure understanding Deep learning models
Field extraction Key-value pairs NER, template matching
Financial Machine Learning · Lecture 05

OCR in Financial Operations

Optical Character Recognition applications:

  • KYC/AML — ID document verification
  • Credit underwriting — Extract income from tax forms
  • Accounts payable — Invoice processing automation
  • Audit — Contract and receipt digitization

Pipeline:

Scan → Deskew → OCR → Field Extraction → Validation → Integration

Benefits: Cost reduction, speed, error minimization

Financial Machine Learning · Lecture 05

Case: Automated Loan Application Processing

SME lending automation:

  1. Upload scanned financial statements, tax documents
  2. OCR extracts text and tables
  3. NLP parses financial metrics
  4. Validation cross-checks extracted data
  5. Scoring feeds into credit model

Results:

  • Processing time: days → minutes
  • Manual review reduced by 70%
  • Error rate decreased significantly
Financial Machine Learning · Lecture 05

Financial Chart Recognition

Automated extraction from chart images:

Tasks:

  • Chart type classification — Line, bar, candlestick, pie
  • Axis detection — Hough transform, edge detection
  • Data extraction — Point/bar measurement
  • Pattern recognition — Technical analysis patterns

Technical patterns detected:

  • Head and shoulders
  • Double tops/bottoms
  • Triangles, flags, wedges
Financial Machine Learning · Lecture 05

ML on stock price charts:

Method:

  • Train CNN on labeled price chart images
  • Learn predictive visual patterns (not predefined)
  • Extract features that forecast returns

Key findings:

  • Patterns yield more accurate predictions than traditional factors
  • Short-term patterns work on longer scales
  • US-learned patterns effective internationally

Source: Jiang, Kelly & Xiu, JF 2023

Financial Machine Learning · Lecture 05

Jiang, Kelly & Xiu (2023)

Financial Machine Learning · Lecture 05
Financial Machine Learning · Lecture 05

Image-Based Fraud Detection

Visual anomaly detection in finance:

Application Method Target
Check fraud Signature verification Forged endorsements
ID verification Face matching + liveness Synthetic identities
Document tampering Pixel analysis Altered invoices
Counterfeit detection Texture analysis Fake documents

Models: CNN-transformer hybrids for anomaly detection

Financial Machine Learning · Lecture 05

Biometric Authentication in Finance

Identity verification workflow:

  1. Capture — Photo of face and ID document
  2. Extraction — Face detection, document parsing
  3. Matching — Compare live face to ID photo
  4. Liveness — Detect presentation attacks
  5. Decision — Accept/reject/manual review

Considerations:

  • Privacy — Biometric data protection regulations
  • Bias — Demographic performance disparities
  • Explainability — Audit trail for decisions
Financial Machine Learning · Lecture 05

Image-Based Property and Climate Risk

Insurance and real estate applications:

  • Property valuation — Aerial imagery for condition assessment
  • Catastrophe modeling — Pre/post disaster comparison
  • Climate risk — Flood, fire, storm exposure mapping
  • Claims processing — Automated damage assessment

Example:
Insurer uses drone imagery for post-hurricane claims.
Pre-event images enable accurate loss estimation.

Financial Machine Learning · Lecture 05

Ethics in Image-Based Finance

Key ethical considerations:

  • Privacy — Consent for image collection and use
  • Surveillance — Balancing security vs. civil liberties
  • Bias — Demographic disparities in recognition accuracy
  • Transparency — Explainable AI for regulated decisions

Best practices:

  • Regular bias audits across demographics
  • Clear disclosure of AI-based decisions
  • Human oversight for high-stakes applications
Financial Machine Learning · Lecture 05

Part 4 · Integration, Implementation, and Outlook

  • Combining Text, Images, and Structured Data
  • Practical Project Workflow
  • Limitations, Ethics, and Research Frontiers
Financial Machine Learning · Lecture 05

Multimodal Learning in Finance

Combining multiple data modalities:

Text Features ─┐
               ├─→ Fusion Layer → Prediction
Image Features ┘

Fusion strategies:

Strategy Description
Early fusion Concatenate raw features
Late fusion Combine model predictions
Attention fusion Learn modality importance

Example: Combine news sentiment + satellite signals + fundamentals

Financial Machine Learning · Lecture 05

Case: Multi-Signal Equity Model

Integrated alternative data approach:

Inputs:

  • Analyst reports (text sentiment)
  • Web traffic data (consumer interest)
  • Satellite parking lot data (sales proxy)
  • Traditional fundamentals (structured)

Model:

  • Feature extraction per modality
  • Late fusion with learned weights
  • Output: Stock selection signals

Benefit: Diversified signal sources reduce model risk

Financial Machine Learning · Lecture 05

Practical Project Workflow

From idea to deployment:

Phase Key Activities
1. Problem framing Define business question, success metrics
2. Data collection Source, clean, validate datasets
3. Labeling Expert annotation or weak supervision
4. Modeling Feature engineering, model selection
5. Evaluation Backtest, out-of-sample validation
6. Deployment Integration, monitoring, maintenance
Financial Machine Learning · Lecture 05

Best Practices Checklist

Documentation and reproducibility:

  • [ ] Version control for code and data
  • [ ] Experiment tracking (MLflow, W&B)
  • [ ] Clear data lineage documentation
  • [ ] Model cards with performance metrics

Team collaboration:

  • Quants ↔ Domain experts ↔ Engineers ↔ Risk managers
  • Regular model review meetings
  • Clear ownership and escalation paths
Financial Machine Learning · Lecture 05

Student Project Ideas

Feasible term paper / capstone projects:

Project Data Methods
News sentiment analysis Financial news API TF-IDF, VADER, FinBERT
Earnings call tone SEC EDGAR transcripts Sentiment, topic modeling
Invoice OCR system Synthetic invoices Tesseract + field extraction
Chart pattern detector Yahoo Finance charts CNN classification

Tools: Python, scikit-learn, PyTorch, spaCy, Tesseract

Financial Machine Learning · Lecture 05

Limitations of Text and Image Analytics

Key challenges:

  • Data quality — Noise, missing data, inconsistency
  • Sample selection — Survivorship bias, coverage gaps
  • Non-stationarity — Markets and language evolve
  • Model robustness — Adversarial attacks, distribution shift

Mitigation:

  • Rigorous out-of-sample testing
  • Ensemble methods for stability
  • Continuous monitoring and retraining
Financial Machine Learning · Lecture 05

Responsible AI in finance:

Issue Consideration
Privacy Data minimization, consent management
Fairness Demographic parity, equal opportunity
Transparency Model explainability, audit trails
Accountability Clear ownership, human oversight

Regulatory trends:

  • Increasing scrutiny of AI in lending decisions
  • Requirements for algorithmic impact assessments
Financial Machine Learning · Lecture 05

Research Frontiers

Emerging themes in financial AI:

  • Robust LLMs — Domain-adapted large language models
  • Synthetic data — Augmenting scarce financial datasets
  • Human-in-the-loop — Hybrid AI-human decision systems
  • Causal ML — Moving beyond prediction to causation
  • Multimodal foundation models — Unified text-image-tabular

Reading list: See course website for recent surveys

Financial Machine Learning · Lecture 05

Summary and Key Takeaways

Core messages:

  1. Text and images are increasingly valuable for financial analysis
  2. NLP pipeline — From preprocessing to embeddings to prediction
  3. Computer vision — Satellite, documents, charts offer unique signals
  4. Integration — Multimodal approaches enhance robustness
  5. Responsibility — Ethics, bias, and governance are critical

"The future belongs to analysts who can extract insights from all data modalities."

Financial Machine Learning · Lecture 05

Further Reading

Recommended resources:

Surveys on AI in Finance:

Financial Machine Learning · Lecture 05

Others

Katona, Z., Painter, M., Patatoukas, P., & Zeng, J. (2025). On the Capital Market Consequences of Big Data: Evidence from Outer Space. Journal of Financial and Quantitative Analysis, 58(4), 1123‑1154.
Loughran, T., & McDonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‑Ks. Journal of Finance, 66(1), 35‑65.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
Hansen, S., McMahon, M., & Prat, A. (2017). Transparency and Deliberation within the FOMC: A Computational Linguistics Approach. Quarterly Journal of Economics, 133(2), 801‑870.
Hoberg, G., & Phillips, G. M. (2016). Text‑Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy, 124(5), 1423‑1465.
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The Geometry of Culture: Analyzing Meaning through Word Embeddings. American Sociological Review, 84(5), 905‑949.
Ash, E., Chen, D. L., Naidu, S., & Rhode, P. W. (2025). Ideas Have Consequences: The Impact of Law and Economics on American Justice.* Quarterly Journal of Economics.

Financial Machine Learning · Lecture 05

Questions and Discussion

Discussion topics:

  1. How might text/image signals be arbitraged away over time?
  2. What are the fairness implications of satellite-based trading?
  3. Should regulators require disclosure of alternative data use?
  4. How do you evaluate the quality of a sentiment indicator?
Financial Machine Learning · Lecture 05

Thank You

Big Data in Finance: Text and Image Analytics


Questions welcome!

Financial Machine Learning · Lecture 05

<small>Total duration: 4 hours (240 minutes)</small>

**Duration: 30 minutes**

### Financial Data Ecosystem

**Duration: 90 minutes**

replace this page

^[Source: Mikolov et al., 2013]

可滚动代码区域

replace

^[Sources: Du et al., 2025 *NLP in Finance* [ref 3]; Kong et al., 2024 *Investment Management* [ref 6]; Jadhav et al., 2025 *Frontiers AI* [ref 2].]

**Duration: 90 minutes**

replace

replace

**Duration: 30 minutes**