|
Term Frequency-Inverse Document Frequency: Where:
Effect: Upweights rare, discriminative words; downweights common words |
|
|
Measuring document similarity: Properties:
Why preferred in NLP? Focuses on direction (topic), not magnitude (length) |
|
Domain-specific lexicons for finance:
| Dictionary | Description | Example Words |
|---|---|---|
| Loughran-McDonald | Finance-specific sentiment | "liability", "litigation" (−) |
| Harvard GI | General sentiment | "good" (+), "bad" (−) |
| VADER | Social media optimized | Handles emojis, slang |
Key insight: General dictionaries misclassify financial terms
Using text features for prediction:
Where:
Solutions to dimensionality:
|
Research finding: Real-time sentiment indices from news and social media predict:
Effect strongest during market stress periods. |
Trading application: Sentiment-based strategies outperform around earnings announcements. |
Problems with sparse word vectors:
Solution: Learn dense, low-dimensional word embeddings
|
Core idea:
Word embeddings:
Famous example: |
|
|
Predict context from target word: Training objective: Maximize probability of observed context words
Computational trick: Negative sampling avoids full vocabulary sum |
|
|
Predict target from context (inverse of Skip-gram): Where Comparison:
|
|
|
|
|
Discovering latent themes in document collections:
Generative process:
Estimation: Gibbs sampling or Variational EM |
|
|
Hybrid Topic Models
|
Financial Applications
BERTopic Code Example
|
Research question: Does transparency change Fed deliberations?
Method:
Finding:
Implication: Transparency may reduce deliberation quality
Financial research using word embeddings:
| Study | Method | Finding |
|---|---|---|
| Hoberg & Phillips (2016) | 10-K cosine similarity | Data-driven industry definitions |
| Kozlowski et al. (2019) | Cultural embeddings | Gender/class associations in text |
| Ash et al. (2025) | Judge embeddings | Judicial sexism measurement |
Key insight: Embeddings capture latent concepts not explicit in text
|
Asset Pricing:
|
Risk Management:
|
Building uncertainty indices from transcripts:
Results:
BERT fine-tuned on financial texts:
Variants:
| Model | Specialization |
|---|---|
| FinBERT-tone | Sentiment analysis |
| FinBERT-SEC | Regulatory filings |
| FinBERT-ESG | ESG disclosure analysis |
Note: Transformer details in subsequent lectures
Key implementation issues:
Best practices:
From task‑specific models → general financial intelligence
Core paradigm shift:
|
|
|
NLP 1.0 vs NLP 2.0
|
Financial LLM Use Cases:
Examples: |
|
Image as numerical array:
Example: 1024×768 RGB image = 2.36 million values Financial image types:
|
|
| Task | Description | Financial Application |
|---|---|---|
| Classification | Assign image to category | Document type identification |
| Detection | Locate objects in image | Car counting in parking lots |
| Segmentation | Pixel-level labeling | Chart region extraction |
| Recognition | Identify specific instances | Face verification for KYC |
Key components:
Advantages:
Input Image → [Conv → ReLU → Pool] × N → Flatten → FC → Output
Layer progression:
| Layer | Learns |
|---|---|
| Early conv | Edges, textures |
| Middle conv | Shapes, patterns |
| Late conv | Objects, scenes |
| FC layers | Task-specific decisions |
Popular architectures: VGG, ResNet, EfficientNet, Inception
Leveraging pretrained models:
Why transfer learning?
Challenge: Financial images differ from natural images
|
Market & Trading:
Documents:
|
Remote Sensing:
Biometric & Security:
|
Capturing real economic activity from space:
| Data Source | Economic Indicator |
|---|---|
| Night lights (VIIRS, DMSP) | GDP, urbanization |
| Parking lots | Retail sales, foot traffic |
| Oil tank shadows | Crude inventory levels |
| Shipping traffic | Trade flows, supply chain |
| Agricultural land | Crop yields, commodity prices |
Advantage: Real-time, unbiased, global coverage
|
Method:
Results:
|
Trading strategy: Hedge funds analyze 4.8M images across 67,000 retail stores. Accurate sales prediction enables profitable positioning. |
End-to-end workflow:
Image Acquisition → Tiling → Preprocessing → Feature Extraction → Aggregation
Steps:
Challenges: Weather effects, acquisition frequency, spatial alignment
Predicting crude oil inventories:
Application:
Accuracy: 2-3 day lead time over official data
Processing scanned financial documents:
| Stage | Task | Methods |
|---|---|---|
| Acquisition | Scanning, photography | Mobile capture, bulk scanners |
| Preprocessing | Deskew, denoise, binarize | Image processing techniques |
| OCR | Text extraction | Tesseract, cloud APIs |
| Layout analysis | Structure understanding | Deep learning models |
| Field extraction | Key-value pairs | NER, template matching |
Optical Character Recognition applications:
Pipeline:
Scan → Deskew → OCR → Field Extraction → Validation → Integration
Benefits: Cost reduction, speed, error minimization
SME lending automation:
Results:
Automated extraction from chart images:
Tasks:
Technical patterns detected:
ML on stock price charts:
Method:
Key findings:
Visual anomaly detection in finance:
| Application | Method | Target |
|---|---|---|
| Check fraud | Signature verification | Forged endorsements |
| ID verification | Face matching + liveness | Synthetic identities |
| Document tampering | Pixel analysis | Altered invoices |
| Counterfeit detection | Texture analysis | Fake documents |
Models: CNN-transformer hybrids for anomaly detection
Identity verification workflow:
Considerations:
Insurance and real estate applications:
Example:
Insurer uses drone imagery for post-hurricane claims.
Pre-event images enable accurate loss estimation.
Key ethical considerations:
Best practices:
Combining multiple data modalities:
Text Features ─┐
├─→ Fusion Layer → Prediction
Image Features ┘
Fusion strategies:
| Strategy | Description |
|---|---|
| Early fusion | Concatenate raw features |
| Late fusion | Combine model predictions |
| Attention fusion | Learn modality importance |
Example: Combine news sentiment + satellite signals + fundamentals
Integrated alternative data approach:
Inputs:
Model:
Benefit: Diversified signal sources reduce model risk
From idea to deployment:
| Phase | Key Activities |
|---|---|
| 1. Problem framing | Define business question, success metrics |
| 2. Data collection | Source, clean, validate datasets |
| 3. Labeling | Expert annotation or weak supervision |
| 4. Modeling | Feature engineering, model selection |
| 5. Evaluation | Backtest, out-of-sample validation |
| 6. Deployment | Integration, monitoring, maintenance |
Documentation and reproducibility:
Team collaboration:
Feasible term paper / capstone projects:
| Project | Data | Methods |
|---|---|---|
| News sentiment analysis | Financial news API | TF-IDF, VADER, FinBERT |
| Earnings call tone | SEC EDGAR transcripts | Sentiment, topic modeling |
| Invoice OCR system | Synthetic invoices | Tesseract + field extraction |
| Chart pattern detector | Yahoo Finance charts | CNN classification |
Tools: Python, scikit-learn, PyTorch, spaCy, Tesseract
Key challenges:
Mitigation:
Responsible AI in finance:
| Issue | Consideration |
|---|---|
| Privacy | Data minimization, consent management |
| Fairness | Demographic parity, equal opportunity |
| Transparency | Model explainability, audit trails |
| Accountability | Clear ownership, human oversight |
Regulatory trends:
Emerging themes in financial AI:
Reading list: See course website for recent surveys
Core messages:
"The future belongs to analysts who can extract insights from all data modalities."
Recommended resources:
Surveys on AI in Finance:
Others
Katona, Z., Painter, M., Patatoukas, P., & Zeng, J. (2025). On the Capital Market Consequences of Big Data: Evidence from Outer Space. Journal of Financial and Quantitative Analysis, 58(4), 1123‑1154.
Loughran, T., & McDonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‑Ks. Journal of Finance, 66(1), 35‑65.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
Hansen, S., McMahon, M., & Prat, A. (2017). Transparency and Deliberation within the FOMC: A Computational Linguistics Approach. Quarterly Journal of Economics, 133(2), 801‑870.
Hoberg, G., & Phillips, G. M. (2016). Text‑Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy, 124(5), 1423‑1465.
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The Geometry of Culture: Analyzing Meaning through Word Embeddings. American Sociological Review, 84(5), 905‑949.
Ash, E., Chen, D. L., Naidu, S., & Rhode, P. W. (2025). Ideas Have Consequences: The Impact of Law and Economics on American Justice.* Quarterly Journal of Economics.
Discussion topics:
Questions welcome!
<small>Total duration: 4 hours (240 minutes)</small>
**Duration: 30 minutes**
### Financial Data Ecosystem
**Duration: 90 minutes**
replace this page
^[Source: Mikolov et al., 2013]
可滚动代码区域
replace
^[Sources: Du et al., 2025 *NLP in Finance* [ref 3]; Kong et al., 2024 *Investment Management* [ref 6]; Jadhav et al., 2025 *Frontiers AI* [ref 2].]
**Duration: 90 minutes**
replace
replace
**Duration: 30 minutes**