Lecture 04

Reinforcement Learning

“Understanding how agents learn to optimize decisions through interactions with their environment.”

Financial Machine Learning · Lecture 04

Outlines

Financial Machine Learning · Lecture 04

Part 1 · Introduction to Reinforcement Learning

Motivation

  • Reinforcement Learning (RL) focuses on learning through interactions with an environment.
  • RL is well-suited for complex decision-making scenarios with delayed feedback.
  • Its applications in finance include adaptive trading, portfolio management, and risk assessment.

This introduction lays the groundwork for understanding RL's relevance in finance.

Financial Machine Learning · Lecture 04

What Is Reinforcement Learning?

Reinforcement Learning (RL) is a branch of machine learning that enables agents to learn optimal actions based on rewards from the environment.

  1. RL learns from interaction with the environment
  2. Goal: learn policy to maximize expected rewards
  3. Core elements: states, actions, rewards, transitions, discount factor
  • Distinct from supervised learning, RL is driven by exploration and exploitation strategies.

Reinforcement learning mimics decision-making processes in real-world scenarios, making it applicable in finance.

Financial Machine Learning · Lecture 04

Applications of Reinforcement Learning in Finance

  • Algorithmic Trading: Develop strategies to buy/sell assets based on predicted market movements.

  • Portfolio Optimization: Automatically adjust asset allocations to achieve desired return profiles.

  • Risk Management: Create adaptive systems to monitor and mitigate financial risks dynamically.

  • Mathematical Formulation (Maximize expected return)

The practical use of RL showcases its capabilities in addressing financial challenges.

Financial Machine Learning · Lecture 04

RL Spectrum in Finance: From Dynamic Programming to Deep RL

  • Classical Dynamic Programming (DP) established the foundation for sequential decision-making under uncertainty.

    In finance, DP solves problems such as portfolio optimization, option pricing, or consumption–investment planning — but these models require a known transition structure and suffer from the curse of dimensionality.

  • Reinforcement Learning (RL) removes the reliance on explicit models.

    By learning from interactions or simulations, RL estimates value functions and policies directly, enabling data-driven control for trading, execution, and risk management tasks.

  • Deep Reinforcement Learning (Deep RL) merges neural networks with RL to approximate complex value or policy functions, scaling to high-dimensional features like historical returns, order-book data, or textual sentiment.

    This evolution—from theory-driven DP to data-driven Deep RL—allows automated agents to operate effectively in realistic, uncertain financial markets.

Financial Machine Learning · Lecture 04

Why This Evolution Matters

Each stage advances our capacity to represent complexity and uncertainty in financial systems:

  • DP: mathematically exact, interpretable, but computationally intractable for high-dimensional markets.
  • RL: model-free and flexible, learns directly from data but may require extensive exploration and careful reward design.
  • Deep RL: scalable and expressive, capturing nonlinear dependencies among financial variables, though prone to instability and overfitting.

Together, these methods reveal a clear trajectory — from model-based optimization to experience-based learning. This progression provides practical tools to tackle portfolio allocation, dynamic hedging, and algorithmic trading in environments where traditional models no longer suffice.

Financial Machine Learning · Lecture 04

Overview of Key Concepts in RL

  • Core Components:
    • Agent: Learner making decisions.
    • Environment: External factors affecting the agent.
    • State (): Current situation of the agent.
    • Action (): Decisions made by the agent.
    • Reward (): Feedback received from the environment.

Understanding these elements is crucial for applying RL techniques effectively in finance.

Financial Machine Learning · Lecture 04

Part 2 · Formulating Finance Problems as RL

Motivation for MDP in Finance

  • Many finance problems can be framed using the Markov Decision Process (MDP).
  • MDPs provide a structured way to represent states and actions under uncertainty.
  • Reinforcement learning helps find optimal strategies in complex financial situations.

This section highlights the suitability of MDPs for financial decision-making.

Financial Machine Learning · Lecture 04

Markov Decision Process (MDP)

MDPs consist of several components:

  • States (): All possible situations.
  • Actions (): The set of all possible choices.
  • Transition probabilities (): Probability of moving from one state to another after an action.
  • Reward function (): The reward received after taking an action in a state.
Component Description Financial Example
States Market state, wealth
Actions Allocation, trade
Transition Price/wealth evolution
Reward Profit or utility

The MDP framework is fundamental to implementing RL in financial applications.

Financial Machine Learning · Lecture 04

Designing State and Action Spaces

  • State Space Design:

    • Incorporate essential financial metrics such as asset prices, volatility, and indicators.
  • Action Space Design:

    • Define potential actions, e.g., buy, sell, or hold.
    • Specify trading quantities or rebalancing strategies.

Careful design of these spaces is critical for the effectiveness of RL agents in finance.

Financial Machine Learning · Lecture 04

Reward Function in Financial Contexts

  • The reward function drives the learning process of the agent by evaluating the effectiveness of its actions.

    • Common formulations include return on investment and risk-adjusted returns.
  • A well-designed reward structure can guide the agents toward long-term profitability.

The formulation of the reward function is essential to align the agent's behavior with financial objectives.

Financial Machine Learning · Lecture 04

Mathematical Formulation

  • denotes the state space of the system. A state is the information which is available for the controller at time . Given this information an action has to be selected.

  • denotes the action space. Given a specific state at time , a certain subclass of actions may only be admissible.

  • is a stochastic transition kernel which gives the probability that the next state at time is in the set B if the current state is and action a is taken at time .

  • gives the (discounted) one-stage reward of the system at time if the current state is and action a is taken

  • gives the (discounted) terminal reward of the system at the end of the planning horizon.

Financial Machine Learning · Lecture 04
  • A control is a sequence of decision rules () with where determines for each possible state the next action at time . Such a sequence is called policy or strategy.
  • Formally the Markov Decision Problem is given by

  • Types of MDP problems:
    • finite horizon () vs. infinite horizon ()
    • complete state observation vs. partial state observation
    • problems with constraints vs. without constraints
    • total (discounted) cost criterion vs. average cost criterion
  • Research questions:
    • Does an optimal policy exist?
    • Has it a particular form?
    • Can an optimal policy be computed efficiently?
    • Is it possible to derive properties of the optimal value function analytically?
Financial Machine Learning · Lecture 04

Applications

  • Consumption Problem: Suppose there is an investor with given initial capital. At the beginning of each of periods she can decide how much of the capital she consumes and how much she invests into a risky asset. The amount she consumes is evaluated by a utility function as well as the terminal wealth. The remaining capital is invested into a risky asset where we assume that the investor is small and thus not able to influence the asset price and the asset is liquid. How should she consume/invest in order to maximize the sum of her expected discounted utility?
  • Cash Balance or Inventory Problem: Imagine a company which tries to find the optimal level of cash over a finite number of periods. We assume that there is a random stochastic change in the cash reserve each period (due to withdrawal or earnings). Since the firm does not earn interest from the cash position, there are holding cost for the cash reserve if it is positive, but also interest (cost) in case it is negative. The cash reserve can be increased or decreased by the management at each decision epoch which implies transfer costs. What is the optimal cash balance policy?
  • Mean-Variance Problem: Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after periods, the one with smallest portfolio variance. What is the optimal investment strategy?

Financial Machine Learning · Lecture 04
  • Dividend Problem in Risk Theory: Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?

  • Bandit Problem: Suppose we have two slot machines with unknown success probability and . At each stage we have to choose one of the arms. We receive one Euro if the arm wins, else no cash flow appears. How should we play in order to maximize our expected total reward over trials?

  • Pricing of American Options
    In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.

Financial Machine Learning · Lecture 04

Defining A Markov Decision Models

Financial Machine Learning · Lecture 04

An equivalent definition of MDP

A Markov Decision Model is equivalently described by the set of data with the following meaning:

  • are as in Definition of last slide

  • is the disturbance space, endowed with a -algebra .

  • is a stochastic transition kernel for and and denotes the probability that is in if the current state is and action is taken.

  • is a measurable function and is called transition or system function. gives the next state of the system at time if at time the system is in state ,action is taken and the disturbance occurs at time .

Financial Machine Learning · Lecture 04

Example: Consumption Problem

We denote by the random return of our risky asset over period . Further we suppose that are non-negative, independent random variables and we assume that the consumption is evaluated by utility functions . The final capital is also evaluated by a utility function . Thus we choose the following data:

  • where denotes the wealth of the investor at time
  • where denotes the wealth which is consumed at time
  • for all , i.e. we are not allowed to borrow money
  • where denotes the random return of the asset
  • is the transition function
  • distribution of (independent of )
  • is the one-stage reward
Financial Machine Learning · Lecture 04
  • Decision rule & strategy
    • A measurable mapping with the property for all , is called a decision rule at time . We denote by the set of all decision rules at time .
    • A sequence of decision rules with is called an -stage policy or -stage strategy.
  • Value function:

  • Theorem: For it holds:

Financial Machine Learning · Lecture 04

Finite Horizon Markov Decision Models

Integrability Assumption (): For

Assumption () is assumed to hold for the -stage Markov Decision Problems.

Financial Machine Learning · Lecture 04

Example: (Consumption Problem) In the consumption problem Assumption () is satisfied if we assume that the utility functions are increasing and concave and for all , because then and can be bounded by an affine-linear function with and since a.s. under every policy, the function satisfies

  • is the expected total reward at time over the remaining stages to if we use policy and start in state at time .

  • The value function is the maximal expected total reward at time over the remaining stages to if we start in state at time .

  • The functions and are well-defined since:
Financial Machine Learning · Lecture 04

The Bellman Equation

Let us denote by , we define the following operators:

  • : for define (whenever the integral exists)

  • : for and define

  • : for define (the maximal reward operator at time .)

Theorem (Reward Iteration): Let be an -stage policy. For it holds:

  • and ,
  • .
Financial Machine Learning · Lecture 04

Example: (Consumption Problem) Note that for the operator in this example reads

Now let us assume that for all and . Moreover, we assume that the return distribution is independent of and has finite expectation . Then () is satisfied as we have shown before. If we choose the -stage policy with and , i.e. we always consume a constant fraction of the wealth, then the Reward Iteration implies by induction on that

Hence, with and maximizes the expected log-utility (among all linear consumption policies).

Financial Machine Learning · Lecture 04

Maximizer, the Bellman Equation & Verification Theorem

  • Definition of a maximizer: Let . A decision is called a maximizer of at time if , i.e. for all , is a maximum point of the mapping ,

  • The Bellman Equation

  • Verification Theorem: Let be a solution of the Bellman equation. Then it holds:

    • for
    • If is a maximizer of for , then and the policy is optimal for the -stage Markov Decision Problem.
Financial Machine Learning · Lecture 04

The Structure Assumption & Structure Theorem

  • Structure Assumption (): There exists sets and such that for all :
    • If then is well-defined and
    • For all there exists a maximizer of with
  • Structure Theorem: Let () be satisfied. Then it holds: and the sequence () satisfies the Bellman equation, i.e. for

    • For there exit maximizers of with , and every sequence of maximizers of defines an optimal policy for the -stage Markov Decision Problem
  • A corollary: Let () be satisfied. If then it holds:

Financial Machine Learning · Lecture 04
  • Principle of Dynamic Programming: Let () be satisfied, Then it holds for :

    i.e. if is optimal for the time period then is optimal for .
Financial Machine Learning · Lecture 04

Value iteration vs policy iteration

Financial Machine Learning · Lecture 04

Modeling The Financial Markets with MDP

  • Asset Dynamics and Portfolio Strategies: We assume that asset prices are monitored in discrete time

    • time is divided into periods of length and
    • multiplicative model for asset prices:
    • The binomial model(Cox-Ross-Rubinstein model) and discretization of the Black-Scholes-Merton model are two important special cases of the multiplicative model
  • An N-period financial market with risky assets and one riskless bond

    • A riskless bond with and

    • There are risky assets and the price process of asset is given by and

Financial Machine Learning · Lecture 04
  • A portfolio or a trading strategy is an ()-adapted stochastic process where and for . The quantity denotes the amount of money which is invested into asset during the time interval .
  • The vector is called the initial portfolio of the investor. The value of the initial portfolio is given by

  • Let be a portfolio strategy and denote by the value of the portfolio at time before trading. Then

  • The value of the portfolio at time after trading is given by

Financial Machine Learning · Lecture 04
  • Self-financing: A portfolio strategy is called self-financing if

    for all , i.e. the current wealth is just reassigned to the assets.

  • Arbitrage opportunity: An arbitrage opportunity is a self-financing portfolio strategy with the following property: and

  • A theorem: Consider an -period financial market. The following two statements are equivalent:

    • There are no arbitrage opportunities.
    • For and for all -measurable it holds:

Financial Machine Learning · Lecture 04

Summary: Modeling and Solution Approaches with MDP

  • Modeling approach: specify the elements of the MDP

    • state space , action space
    • transition function:
    • value function
    • the Bellman equation:
  • Solution approaches

    • backward induction (with the Bellman equation)
    • the existence and form of optimal policy
    • we are interested in the structured properties that are preserved in the iterations
Financial Machine Learning · Lecture 04

MDP Applications in Finance: A Cash Balance Problem

  • involves the decision about the optimal cash level of a firm over a finite number of periods, a random stochastic change in the cash reserve each period (which can be both positive and negative).
  • costs
    • holding cost or opportunity cost for the cash reserve if it is positive
    • out-of-pocket expense
  • The cash reserve can be increased or decreased by the management at the beginning of each period
    • transfer costs

    • random changes in the cash flow are given by independent and identically distributed random variables () with finite expectation
  • The cost have to be paid at the beginning of a period for cash level :

Financial Machine Learning · Lecture 04

Problem Formulation

  • Elements of MDP
    • where denotes the cash level,
    • where denotes the new cash level after transfer,
    • ,
    • where denotes the cash change,
    • ,
    • ,
    • ,
Financial Machine Learning · Lecture 04
  • The state transition function:

  • The value function

  • The Bellman equation

Financial Machine Learning · Lecture 04

Solution

  • Solution is worked through by backward induction
  • We verified the validity of Integrability Assumption () and the relative Structure Assumption () for each
  • The solution to the cash balance problem (Theorem 2.6.2):
    • There exist critical levels and such that for

      with .
    • There exist critical levels and such that for

Financial Machine Learning · Lecture 04

MDP Applications in Finance: Consumption and Investment Problems

The investor has an initial wealth and at the beginning of each of periods she can decide how much of the wealth she consumes and how much she invests into the financial market.

The amount which is consumed at time is evaluated by a utility function . The remaining wealth is invested in the risky assets and the riskless bond, and the terminal wealth yields another utility .

How should the agent consume and invest in order to maximize the sum of her expected utilities?

Financial Machine Learning · Lecture 04

Formulation

  • Assumption (FM)
    • There is no arbitrage opportunity.
    • for all .
  • The utility functions and satisfy .
  • The wealth process ()evolves as follows

    where is a consumption-investment strategy, i.e. ()and ()are -adapted and .
  • The consumption-investment problem is then given by

Financial Machine Learning · Lecture 04
  • Elements of MDP
    • where denotes the wealth,
    • where is the amount of money invested in the risky assets and the amount which is consumed,
    • is given by

    • where denotes the relative risk
    • ,
    • .
  • The value function

Financial Machine Learning · Lecture 04

Solution

  • Let and be utility functions with . Then it holds:
    • There are no arbitrage opportunities if and only if there exists a measurable function such that

    • is strictly increasing, strictly concave and continuous on .
  • For the multiperiod consumption-investment problem it holds:
    • The value functions are strictly increasing, strictly concave and continuous.
    • The value functions can be computed recursively by the Bellman equation

    • There exist maximizers of and the strategy is optimal for the -stage consumption-investment problem.
Financial Machine Learning · Lecture 04

MDP Applications in Finance: Terminal Wealth Problems

Suppose we have an investor with utility function with or and initial wealth . A financial market with risky assets and one riskless bond is given. Here we assume that the random vectors are independent but not necessarily identically distributed. Moreover we assume that is the filtration generated by the stock prices, i.e. . We make the (FM) assumption for the financial market.

Our agent has to invest all the money into this market and is allowed to rearrange her portfolio over stages. The aim is to maximize the expected utility of her terminal wealth.

Financial Machine Learning · Lecture 04

Formulation

  • The wealth process () evolves as follows

    where is a portfolio strategy. The optimization problem is then

  • Elements of the MDP
    • where denotes the wealth,
    • where is the amount of money invested in the risky assets,
    • ,
    • where denotes the relative risk,
    • ,
    • ,
    • , and .

Financial Machine Learning · Lecture 04

Solution

For the multiperiod terminal wealth problem it holds:

  • The value functions are strictly increasing, strictly concave and continuous.

  • The value functions can be computed recursively by the Bellman equation

  • There exist maximizers of and the strategy is optimal for the -stage terminal wealth problem.

Financial Machine Learning · Lecture 04

MDP Applications in Finance: Portfolio Selection with Transaction Costs

We consider now the utility maximization problem under proportional transaction costs. For the sake of simplicity we restrict to one bond and one risky asset. If an additional amount of (positive or negative) is invested in the stock, then proportional transaction costs of are incurred which are paid from the bond position. We assume that . In order to compute the transaction costs, not only is the total wealth interesting, but also the allocation between stock and bond matters. Thus, in contrast to the portfolio optimization problems so far, the state space of the Markov Decision Model is two-dimensional and consists of the amounts held in the bond and in the stock.

Financial Machine Learning · Lecture 04

Formulation

  • Elements of the MDP
    • where denotes the amount invested in bond and stock,
    • where denotes the amount invested in bond and stock after transaction,
    • ,
    • where denotes the relative price change of the stock,
    • ,
    • ,
    • ,
    • .
Financial Machine Learning · Lecture 04

MDP Applications in Finance: Dynamic Mean-Variance Problems

We use the same non-stationary financial market as for the Terminal Wealth Problems with independent relative risk variables. Our investor has initial wealth . This wealth can be invested into risky assets and one riskless bond. How should the agent invest over periods in order to find a portfolio with minimal variance which yields at least an expected return of ?

Financial Machine Learning · Lecture 04

Formulation

  • Elements of the MDP
    • where denotes the wealth,
    • where is the amount of money invested in the risky assets,
    • ,
    • where denotes the relative risk,
    • ,
    • distribution of (independent of ).
  • The original formulation (MV)

  • An equivalent formulation (MV)

Financial Machine Learning · Lecture 04
  • Assumption (FM):

    • and for all .
    • The covariance matrix of the relative risk process

      is positive definite for all .
      • .
  • Problem (MV) can be solved via the well-known Lagrange multiplier tecnique. Let be the Lagrange-function, i.e.

  • The Lagrange-problem for the parameter

Financial Machine Learning · Lecture 04
  • A stochastic LQ-problem

    If is optimal for ,then is optimal for with .

  • Elements od MDP

    • ,
Financial Machine Learning · Lecture 04
  • Solution: For the mean-variance problem (MV) it holds:

    • The value of is given by

      where is given in (4.34). Note that .
    • The optimal portfolio strategy is given by

  • Dynamic Mean-Risk Problems

Financial Machine Learning · Lecture 04

MDP Applications in Finance: Index Tracking

Suppose we have a financial market with one bond and d risky assets. Besides the tradeable assets there is a non-tradable asset whose price process evolves according to

The positive random variable which is the relative price change of the non-traded asset may be correlated with . It is assumed that the random vectors are independent and the joint distribution of is given.

The aim now is to track the non-traded asset as closely as possible by investing into the financial market. The tracking error is measured in terms of the quadratic distance of the portfolio wealth to the price process , i.e. the optimization problem is then

where is -measurable.

Financial Machine Learning · Lecture 04

Formulation

  • Elements of the MDP
    • where and is the wealth and the value of the non-traded asset,
    • where is the amount of money invested in the risky assets,
    • ,
    • where and is the relative risk of the traded assets and is the relative price change of the non-traded asset.
    • The transition function is given by

    • joint distribution of (independent of ),
    • .
  • Value function (cost-to-go)

Financial Machine Learning · Lecture 04

Part 3 · Reinforcement Learning Algorithms

Value-based vs. Policy-based Methods

  • Value-based Methods:

    • Focus on learning a value function to make decisions (e.g., Q-learning).
  • Policy-based Methods:

    • Directly optimize the action policy that defines the agent’s behavior (e.g., REINFORCE).

These methods offer different strengths and suit various financial contexts.

Financial Machine Learning · Lecture 04
Financial Machine Learning · Lecture 04
Financial Machine Learning · Lecture 04

Q-Learning Overview

  • Q-learning is an off-policy value-based algorithm:
    • Learns the value of actions without requiring a model of the environment.
    • Updates value estimates iteratively based on experiences.

Q-learning is a versatile algorithm applicable in various trading strategy developments.

Financial Machine Learning · Lecture 04

Deep Q-Networks (DQN)

  • DQN employs deep learning to approximate the Q-value function:
    • Overcomes the limitations of traditional Q-learning in high-dimensional environments.
    • Enables learning from raw market data such as price movements.

DQN has shown significant success in complex financial applications, improving trading strategies.

Financial Machine Learning · Lecture 04

Policy Gradient Methods

  • Policy gradient methods optimize the actual policy:
    • Use gradient ascent on expected returns to improve decisions directly (e.g., REINFORCE).
  • Particularly effective in scenarios with continuous action spaces like trading.

These methods provide more flexibility and adaptability to various financial strategies.

Financial Machine Learning · Lecture 04

Part 4 · Deep Reinforcement Learning and Applications

Combining RL with Deep Learning

  • Deep reinforcement learning blends neural networks with RL principles:

    • Capable of processing high-dimensional inputs like market data and images.
  • This fusion significantly enhances the performance of RL in complex environments.

Deep RL methods are revolutionizing the way financial decisions are automated.

Financial Machine Learning · Lecture 04

Application: Optimal Execution

  • the problem
    • a trader who wishes to buy or sell a given amount of a single asset within a given time period
    • the trader seeks strategies that maximize their return from, or alternatively, minimize the cost of, the execution of the transaction
  • The Almgren-Chriss Model
    • a trader is required to sell an amount of an asset, with price at time 0 , over the time period with trading decisions made at discrete time points
    • The final inventory is required to be zero
    • the goal is to determine the liquidation strategy
    • two types of price impact
      • a temporary impact which refers to any temporary price movement due to the supply-demand imbalance caused by the selling
      • a permanent impact, which is a long-term effect on the 'equilibrium' price due to the trading
Financial Machine Learning · Lecture 04
  • The Almgren-Chriss Model
    • asset price dynamics

      • is the (constant) volatility parameter
      • are independent random variables drawn from a distribution with zero mean and unit variance
      • is a function of the trading strategy that measures the permanent impact
    • The inventory process

    • the actual price per share received considering the temporary price impact

Financial Machine Learning · Lecture 04
  • The Almgren-Chriss Model
    • The cost of this trading trajectory
      • the difference between the initial book value and the revenue, given by
      • mean and variance

    • the trader's objective function

Financial Machine Learning · Lecture 04
  • Solution
    • linear price impacts

    • the general solution for the Almgren-Chriss model

      with

    • The corresponding optimal inventory trajectory is

Financial Machine Learning · Lecture 04
  • Evaluation Criteria

    • the Profit and Loss (PnL): the final profit or loss induced by a given execution algorithm over the whole time period, which is made up of transactions at all time point
    • the Implementation Shortfall: the difference between the of the algorithm and the PnL received by trading the entire amount of the asset instantl
    • the Sharp ratio: the ratio of expected return to standard deviation of the return
  • Benchmark Algorithms

    • Time-Weighted Average Price (TWAP)
    • Volume-Weighted Average Price (VWAP)
    • Submit and Leave (SnL) policy
Financial Machine Learning · Lecture 04
  • RL Approach
    • RL methods
      • -learning algorithms
      • (double) DQN
      • Policy-based algorithms: (deep) policy gradient methods (II1, A2C, PPO, and DDPG)
    • states: time stamp, the market attributes including (mid-) price of the asset and/or the spread, the inventory process and past returns
    • controls: the amount of asset (using market orders) to trade and/or the relative price level (using limit orders) at each time point
    • rewards
      • cash inflow or outflow (depending on whether we sell or buy)
      • implementation shortfall
      • profit, Sharpe ratio, return,
Financial Machine Learning · Lecture 04

Application: (multi-period mean-variance) Portfolio Optimization

  • setting
    • risky assets in the market
    • an investor enters the market at time 0 with initial wealth
    • reallocate his wealth at each time point among the assets to achieve the optimal trade off between the return and the risk of the investment
    • the random rates of return of the assets at :
      • The vectors , are assumed to be statistically independent
      • mean:
      • standard deviation: for and
      • The covariance matrix:
    • the wealth of the investor at time :
    • the amount invested in the -th asset at time :
    • the amount invested in the -th asset:
    • investment strategy: for

Financial Machine Learning · Lecture 04
  • the model
    • objective

    • constraints

  • the solution

Financial Machine Learning · Lecture 04
  • RL Approach
    • RL methods
      • value-based methods (Q-learning, SARSA, and DQN)
      • policy-based algorithms (DPG and DDPG)
    • states: time, asset prices, asset past returns, current holdings of assets, and remaining balance
    • controls: the amount/proportion of wealth invested in each component of the portfolio
    • rewards
      • portfolio return
      • (differential) Sharpe ratio
      • profit
Financial Machine Learning · Lecture 04

Application: Option Pricing and Hedging

  • The Black-Scholes Model
    • The underlying stock price

    • the Black-Scholes-Merton partial differential equation

    • solution

Financial Machine Learning · Lecture 04
  • RL Approach
    • RL methods: (deep) Q-learning, PPO, and DDPG
    • states: asset price, current positions, option strikes, and time remaining to expiry.
    • controls: the change in holdings
    • rewards
      • (risk-adjusted) expected wealth/return
      • option payoff
      • (risk-adjusted) hedging cost
Financial Machine Learning · Lecture 04

Market Making

  • The objective in market making is to profit from earning the bid-ask spread without accumulating undesirably large positions (known as inventory)
  • A market maker faces three major sources of risk
    • The inventory risk: the risk of accumulating an undesirable large net inventory, which significantly increases volatility due to market movements
    • The execution risk: the risk that limit orders may not get filled over a desired horizon
    • the adverse selection risk: the situation where there is a directional price movement that sweeps through the limit orders submitted by the market marker such that the price does not revert back by the end of the trading horizon.
Financial Machine Learning · Lecture 04
  • Stochastic Control Approach.
    • a high-frequency market maker trading on a single stock over a finite horizon
    • the mid-price of this stock follows an arithmetic Brownian motion

    • The market maker will continuously propose bid and ask prices, and respectively
    • She will buy and sell shares according to the rate of arrival of market orders at the quoted prices
    • Her inventory

    • quoted prices: and
    • the intensities and
      • depend on the difference between the quoted prices and the reference price (i.e. and
      • the functional form

Financial Machine Learning · Lecture 04
  • Stochastic Control Approach.
    • the cash process of the market maker

    • the market maker optimizes a constant absolute risk aversion (CARA) utility function:

    • the value function solves the following HamiltonJacobi-Bellman equation:

Financial Machine Learning · Lecture 04
  • RL Approach
    • RL mrthods
      • value-based methods (Q-learning algorithm and SARSA)
      • policy-based methods (deep policy gradient method)
    • states: bid and ask prices, current holdings of assets, order-flow imbalance, volatility, and some sophisticated market indices
    • controls: the spread to post a pair of limit buy and limit sell orders
    • rewards
      • PnL with inventory cost
      • Implementation Shortfall with inventory cost
Financial Machine Learning · Lecture 04

Application: Robo-advising

Stochastic Control Approach

  • the framework
    • a regime switching model of market returns
    • a mechanism of interaction between the client and the robo-advisor
    • a dynamic model (i.e., risk aversion process) for the client's risk preferences
    • an optimal investment criterion
  • the robo-advisor interacts repeatedly with the client and learns about changes in her risk profile
  • The robo-advisor adopts a multi-period mean-variance investment criterion with a finite investment horizon based on the estimate of the client's risk aversion level

Stochastic Control Approach

  • the framework
    • a regime switching model of market returns
    • a mechanism of interaction between the client and the robo-advisor
    • a dynamic model (i.e., risk aversion process) for the client's risk preferences
    • an optimal investment criterion
  • the robo-advisor interacts repeatedly with the client and learns about changes in her risk profile
  • The robo-advisor adopts a multi-period mean-variance investment criterion with a finite investment horizon based on the estimate of the client's risk aversion level
Financial Machine Learning · Lecture 04

Application: Smart Order Routing

  • Dark Pools vs. Lit Pools
    • Dark pools are private exchanges for trading securities that are not accessible by the investing public
      • Dark pools were created in order to facilitate block trading by institutional investors who did not wish to impact the markets with their large orders and obtain adverse prices for their trades
      • three types of dark pools: (1) Broker-Dealer-Owned Dark Pools, (2) Agency Broker or Exchange-Owned Dark Pools, and (3) Electronic Market Makers Dark Pools.
    • Lit pools do display bid offers and ask offers in different stocks
      • primary exchanges operate in such a way that available liquidity is displayed at all times and form the bulk of the lit pools available to traders.
Financial Machine Learning · Lecture 04
  • the most important characteristics of different dark pools
    • the chances of being matched with a counterparty
    • the price (dis)advantages
  • characteristics of lit pools
    • the order flows
    • queue sizes
    • cancellation rates
Financial Machine Learning · Lecture 04
  • Increasing adoption of RL for automated trading agents that integrate learning mechanisms.
  • The use of multi-agent systems for competitive trading metrics and enhancing market making.

These trends highlight the rapid evolution and growing importance of RL in the financial landscape.

Financial Machine Learning · Lecture 04

Part 5 · Summary and Discussion

Summary of Key Takeaways

  • Reinforcement Learning is a powerful tool for optimizing financial decisions.
  • The convergence of RL with deep learning is yielding innovative strategies in finance.
  • Ongoing research and advancements will continue to explore new applications and methodologies.

Discussion encourages reflection on RL's transformative potential in finance and its future trajectory.

Financial Machine Learning · Lecture 04

Final Takeaways

  • RL changes decision-making paradigms in finance.
  • Integration of modern RL techniques supports financial strategies.
  • Next lecture: Big Data & ML in Finance
    → Explore big data analytical methods in financial contexts.
Financial Machine Learning · Lecture 04

### Application Case Studies in Financial DRL <font size=5> - **Algorithmic Trading:** Employ DRL to develop adaptive trading algorithms responding in real-time to market shifts. - **Portfolio Management:** Use DRL for optimizing asset allocations based on evolving market conditions and investor preferences. *Illustrating successful implementations underscores the practical utility of DRL in finance.* </font> ---