Advances of RL in Finance

Hambly, B., Xu, R., & Yang, H. (2023). Recent advances in reinforcement learning in finance. arXiv preprint arXiv:2112.04553.

The Basics of Reinforcement Learning

Setup: Markov Decision Processes

  • Infinite Time Horizon and Discounted Rewards
    • state space
    • action space
    • the value function for each

subject to

    • the Dynamic Programming Principle (DPP)

We write the value function as

  • the -function

  • the Bellman equation for the -function

  • Finite Time Horizon

    • the value function

subject to

  • the Dynamic Programming Principle (DPP)

  • the terminal condition

We write the value function as

    • the -function

  • the Bellman equation for the -function

  • the terminal condition

for all .

Theorem Assume and with probalbility one. For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.

  • Linear MDPs and Linear Functional Approximation
    • the infinite horizon setting, an MDP is said to be linear with a feature map , if there exist unknown (signed) measures over and an unknown vector such that for any , we have

  • the finite horizon setting, an MDP is said to be linear with a feature map , if for any , there exist unknown (signed) measures over and an unknown vector such that for any , we have

Typically fcatures are assumed to be known to the agent and bounded, namely, for all

  • Nonlinear Functional Approximation
    • nonlinear functional approximations do not require knowledge of the kernel functions a priori
    • The most popular nonlincar functional approximation approach is to use neural networks (the universal approximation theorem)
    • gradient-based algorithms for certain neural network architectures enjoy provable convergence guarantees

From MDP to Learning

  • RL Problem
    • the transition dynamics and the reward function for the infinite horizon MDP problem are unknown
    • to find an optimal stationary policy (if it exists) while simultaneously learning the unknown and
    • The learning of and can be either explicit or implicit
      • model-based RL
      • model-free RL
  • Agent-environment Interface
    • The learner or the docision-maker is called the agent
    • The physical world that the agent operates in and interacts with
    • The agent and environment interact at each of a sequence of discrete time steps, , in the following way.
      • At the beginning of each time step , the agent receives some representation of the environment's states, and selects an action .
      • At the end of this time step, the agent receives a numerical reward (possibly stochastic) and a new state from the environment.
    • The tuple is called a sample at time
    • is referred to as the history or experience up to time t
    • An RL algorithm is a finite sequence of well-defined policies for the agent to interact with the environment.
  • Exploration vs Exploitation
    • the agent learns to select her actions based on her past experiences (exploitation) and/or by trying new choices (exploration).
      • Exploration provides opportunities to improve performance from the current sub-optimal solution to the ultimate globally optimal one, yet it is time consuming and computationally expensive as over-exploration may impair the convergence to the optimal solution.
      • Pure exploitation, tends to yield sub-optimal global solutions.
    • An appropriate trade-olf between exploration and exploitation is crucial in the design of RL algorithms in order to improve the learning and the optimization performance.
  • Simulator

    • a.k.a. generative model
    • allows the algorithm to query arbitrary state-action pairs and return the reward and the next state
    • the agent can "restart" the system at any time
    • it significantly alleviates the difficulty of exploration
  • Randomized Policy vs Deterministic Policy

    • A randomized policy is also known as a relaxed control (in the control literature) and as a mixed strategy (in game theory).
    • most of the RL algorithms adopt randomized policies to encourage exploration when agents are not certain about the environment.

Performance Evaluation

  • episodic criterion
    • (usually) a finite time horizon
    • one episode contains a sequence of states, actions and rewards, which starts at time 0 and ends at the terminal time
    • the performance is evaluated in terms of the total number of samples in all episodes
    • the episodic criterion can also be used to analyze infinite horizon problems
  • Notations
      • the order of the scaling in and (when these are finite)
      • the discount rate (in the infinite horizon case)
      • the dimension of the features (when functional approximation is used)
      • only include the dependence on
      • a polynomial function of its arguments
Sample Complexity
  • Sample complexity: the total number of samples required to find an approximately optimal policy

    • Sample:
    • can be used for any kind of RL problem
  • Sample complexity for episodic RL

    • the minimum number of samples such that for all is -optimal with probability at least , i.e., for ,

  • Sample complexity for discounted RL
    • the -sample complexity:
      • the minimum number of samples such that for all is -close to , that is

      • it holds either with high probability (that is ) or in the expectation sense
    • the -sample complexity:
      • the minimum number of samples such that for all is -close to , that is

      • it holds either with high probability (in that or in the expectation sense
    • sample complexity of exploration
      • the number of samples such that the non-stationary policy at time is not -optimal for the current state
      • it counts the number of mistakes along the whole trajectory
      • is the minimum number of samples such that for all is -optimal when starting from the current state with probability at least , i.e., for ,

    • the mean-square sample complexity
      • measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution
      • is defined as the minimum number of samples such that for all ,

Rate of Convergence
  • uses the relationship between the number of iterations/steps and the error term , to quantify how fast the learned policy converges to the optimal solution
  • is also called the iteration complexity
  • it calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration.
  • the rate of convergence coincides with the sample complexity when only one sample is used in each iteration
  • the rate of convergence and the sample complexity are of the same order with respect to when a constant number of samples are used within each iteration
  • different rate of converge
    • converge at a linear rate:
    • converge at a sublinear rate (slower than linear): with
    • converge at a supertinear (faster than linear):
Regret Analysis
  • the difference between the cumulative reward of the optimal policy and that gathered by
  • it quantifies the exploration-exploitation trade-off
  • the regret of an algorithm after episodes with steps is

  • currently no regret analysis for RL problems with infinite time horizon and discounted reward
Asymptotic Convergence
  • requires the error term to decay to zero as goes to infinity (without specifying the order of )
  • often a first step in the theoretical analysis of RL algorithms.

Classification of RL Algorithms

  • Components of An RL Algorithm
    • model-free RL
      • a representation of a value function that provides a prediction of how good each state or each state/action pair is,
      • a direct representation of the policy or ,
    • model-based RL
      • a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).
  • model-based algorithms
    • maintain an approximate MDP model by estimating the transition probabilities and the reward function
    • derive a value function from the approximate MDP
    • the policy is then derived from the value function
    • another line of model-based algorithms make structural assumptions on the model, using some prior knowledge, and utilize the structural information for algorithm design
  • model-free algorithms
    • directly learn a value (or state-value) function or the optimal policy without inferring the model
    • model-free algorithms can be further divided into two categories
      • value-based methods: store only a value function without an explicit policy during the learning process
      • policy-based methods: explicitly build a representation of a policy and keep it in memory during learning

Value-based Methods

  • setting
    • infinite time horizon with discounting
    • finite state and action spaces and
    • stationary policies

Temporal-Difference Learning (时序差分学习)

The agent can update her estimate of the value function at the iteration by

The TD update can also be written as

Algorithm 1 Method for estimating
1: Input: total number of iterations ; the policy used to sample observations; rule to set learning rate
2: Initialize for all
3: Initialize
4: for do
5: Sample action according to
6: Observe and after taking action
7: Update with
8:
9: end for

Q-learning Algorithm

  • -learning is a stochastic approximation to the Bellman equation for the -function with samples observed by the agent
  • At iteration , the -function is updated using a sample where ,

Algorithm 2 -learning with samples from a policy
1: Input: total number of iterations ; the policy to be evaluated; rule to set learning rate
2: Initialize for all and
3: Initialize
4: for do
5: Sample action according to
6: Observe and after taking action
7: Update with sample
8:
9: end for

Theorem (Watkins and Dayan (1992)). Assume and . Let be the index of the -th time that the action is used in state s. Let be a constant. Given bounded rewards , learning rate and

then as with probability 1

SARSA

SARSA adopts a policy which is based on the agent's current estimate of the -function.

Algorithm 3 SARSA: On-policy TD Learning
1: Input: total number of iterations , the learning rate , and small parameter .
2: Initialize for all and
3: Initialize
4: Initialize : Choose when in using a policy derived from (e.g, -greedy)
5: for do
6: Take action , observe reward and the next step
7: Choose when in using a policy derived from (e.g., -greedy)
8:
9: end for

On-policy Learning vs. Off-policy Learning

  • An off-policy agent learns the value of the optimal policy independently of the agent's actions.
  • An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps.

Policy-based Methods

  • parametrization
    • the policy by a vector
    • : the probability distribution parameterized by over the action space given the state at time
    • the policy objective function for RL with infinite time horizon

    • the policy objective function for RL with finite time horizon

  • the policy parameter is updated using the gradient ascent rule to maximize :

Policy Gradient Theorem

Theorem (Policy Gradient Theorem). Assume that is differentiable with respect to and that there exists , the stationary distribution of the dynamics under policy , which is independent of the initial state . Then the policy gradient is

REINFORCE: Monte Carlo Policy Gradient

Algorithm 4 REINFORCE: Monte Carlo Policy Gradient
1: Input: total number of iterations , learning rate , a differentiable policy parametrization
, finite length of the trajectory .
2: Initialize policy parameter .
3: for do
4: Sample a trajectory
5: Initialize
6: for do
7: Calculate the return:
8: Update the policy parameter
9: end for
10: end for

Actor-Critic Methods

Algorithm 5 Actor-Critic Algorithm
1: Input: A differentiable policy parametrization , a differentiable -function parameterization , learning rates and , number of iterations
2: Initialize policy parameter and -function parameter
3: Initialize
4: for do
5: Sample
6: Take action , observe state and reward
7: Sample action
8: Compute the TD error:
9: Update
10:
11:
12: end for

Deep Reinforcement Learning

Parametrize value functions and or policies via (deep) neural networks.

Neural Networks

  • Fully-connected Neural Networks (FNN)
    • the simplest neural network architecture where any given neuron is connected to all neurons in the previous layer
    • the functional form of the is

  • Convolutional Neural Networks
    • a kind of feed-foward neural network that are especially popular for image processing
    • in the finance setting CNNs have been successfully applied to price prediction based on inputs which are images containing visualizations of price dynamics and trading volumes
    • two main building blocks
      • convolutional layers: capture local patterns in the images
      • pooling layers: reduce the dimension of the problem and improve the computational efficiency
  • Recurrent Neural Networks
    • are widely used in processing sequential data, including speech, text and financial time series data
    • RNNs are a class of artificial neural networks where connections between units form a directed cycle
    • RNNs can use their internal memory to process arbitrary sequences of inputs and hence are applicable to tasks such as sequential data processing.

Deep Value-based RL Algorithms

  • Neural Fitted -learning

    • fitted -learning: a generalization of the classical -learning algorithm with functional approximations
    • it is applied in an off-line setting with a pre-collected dataset in the form of tuples with and
    • when the class of approximation functionals is constrained to neural networks, the algorithm is referred to as Neural Fitted -learning
  • Deep Q-Network (DQN)

    • the slow-update of the target network
    • the use of 'experience replay'
  • Double Deep -network (DQN)

    • decouples the selection from the evaluation

Deep Policy-based RL Algorithms

  • Parameterize the policy by a neural network with parameter
    • for some function
    • A popular choice of is given by

  • The policy parameter is updated using the gradient ascent rule given by

Algorithm 6 The DDPG Algorithm
1: Input: an actor , a critic network , learning rates and , initial parameters and
2: Initialize target networks parameters and ,
3: Initialize replay buffer
4: for do
5: Initialize state
6: for do
7: Select action with
8: Execute action
and observe reward and observe new state
9: Store transition in
10: if then
11: Sample a random mini-batch of transitions
from
12: Set the target .
13: Update the critic by minimizing the loss:
with
14: Update the actor by using the sampled policy gradient:
with
15: Update the target networks:

16: end if
17: end for
18: end for

Applications in Finance

Optimal Execution

  • the problem
    • a trader who wishes to buy or sell a given amount of a single asset within a given time period
    • the trader seeks strategies that maximize their return from, or alternatively, minimize the cost of, the execution of the transaction
  • The Almgren-Chriss Model
    • a trader is required to sell an amount of an asset, with price at time 0 , over the time period with trading decisions made at discrete time points
    • The final inventory is required to be zero
    • the goal is to determine the liquidation strategy
    • two types of price impact
      • a temporary impact which refers to any temporary price movement due to the supply-demand imbalance caused by the selling
      • a permanent impact, which is a long-term effect on the 'equilibrium' price due to the trading
    • asset price dynamics

      • is the (constant) volatility parameter
      • are independent random variables drawn from a distribution with zero mean and unit variance
      • is a function of the trading strategy that measures the permanent impact
    • The inventory process

    • the actual price per share received considering the temporary price impact

  • The cost of this trading trajectory
    • the difference between the initial book value and the revenue, given by
    • mean and variance

  • the trader's objective function

  • Solution
    • linear price impacts

    • the general solution for the Almgren-Chriss model

with

    • The corresponding optimal inventory trajectory is

  • Evaluation Criteria

    • the Profit and Loss (PnL): the final profit or loss induced by a given execution algorithm over the whole time period, which is made up of transactions at all time point
    • the Implementation Shortfall: the difference between the of the algorithm and the PnL received by trading the entire amount of the asset instantl
    • the Sharp ratio: the ratio of expected return to standard deviation of the return
  • Benchmark Algorithms

    • Time-Weighted Average Price (TWAP)
    • Volume-Weighted Average Price (VWAP)
    • Submit and Leave (SnL) policy
  • RL Approach
    • RL methods
      • -learning algorithms
      • (double) DQN
      • Policy-based algorithms: (deep) policy gradient methods (II1, A2C, PPO, and DDPG)
    • states: time stamp, the market attributes including (mid-) price of the asset and/or the spread, the inventory process and past returns
    • controls: the amount of asset (using market orders) to trade and/or the relative price level (using limit orders) at each time point
    • rewards
      • cash inflow or outflow (depending on whether we sell or buy)
      • implementation shortfall
      • profit
      • Sharpe ratio
      • return

(multi-period mean-variance) Portfolio Optimization

  • setting
    • risky assets in the market
    • an investor enters the market at time 0 with initial wealth
    • reallocate his wealth at each time point among the assets to achieve the optimal trade off between the return and the risk of the investment
    • the random rates of return of the assets at :
      • The vectors , are assumed to be statistically independent
      • mean:
      • standard deviation: for and
      • The covariance matrix:
    • the wealth of the investor at time :
    • the amount invested in the -th asset at time :
    • the amount invested in the -th asset:
    • investment strategy: for
  • the model
    • objective

    • constraints

  • the solution

  • RL Approach
    • RL methods
      • value-based methods (Q-learning, SARSA, and DQN)
      • policy-based algorithms (DPG and DDPG)
    • states: time, asset prices, asset past returns, current holdings of assets, and remaining balance
    • controls: the amount/proportion of wealth invested in each component of the portfolio
    • rewards
      • portfolio return
      • (differential) Sharpe ratio
      • profit

Option Pricing and Hedging

  • The Black-Scholes Model
    • The underlying stock price

    • the Black-Scholes-Merton partial differential equation

    • solution

  • RL Approach
    • RL methods: (deep) Q-learning, PPO, and DDPG
    • states: asset price, current positions, option strikes, and time remaining to expiry.
    • controls: the change in holdings
    • rewards
      • (risk-adjusted) expected wealth/return
      • option payoff
      • (risk-adjusted) hedging cost

Market Making

  • The objective in market making is to profit from earning the bid-ask spread without accumulating undesirably large positions (known as inventory)
  • A market maker faces three major sources of risk
    • The inventory risk: the risk of accumulating an undesirable large net inventory, which significantly increases volatility due to market movements
    • The execution risk: the risk that limit orders may not get filled over a desired horizon
    • the adverse selection risk: the situation where there is a directional price movement that sweeps through the limit orders submitted by the market marker such that the price does not revert back by the end of the trading horizon.
  • Stochastic Control Approach.

    • a high-frequency market maker trading on a single stock over a finite horizon
    • the mid-price of this stock follows an arithmetic Brownian motion

    • The market maker will continuously propose bid and ask prices, and respectively
    • She will buy and sell shares according to the rate of arrival of market orders at the quoted prices
    • Her inventory

    • quoted prices: and
    • the intensities and
      • depend on the difference between the quoted prices and the reference price (i.e. and
      • the functional form

    • the cash process of the market maker

    • the market maker optimizes a constant absolute risk aversion (CARA) utility function:

    • the value function solves the following HamiltonJacobi-Bellman equation:

  • RL Approach
    • RL mrthods
      • value-based methods (Q-learning algorithm and SARSA)
      • policy-based methods (deep policy gradient method)
    • states: bid and ask prices, current holdings of assets, order-flow imbalance, volatility, and some sophisticated market indices
    • controls: the spread to post a pair of limit buy and limit sell orders
    • rewards
      • PnL with inventory cost
      • Implementation Shortfall with inventory cost

Robo-advising

  • Stochastic Control Approach
    • the framework
      • a regime switching model of market returns
      • a mechanism of interaction between the client and the robo-advisor
      • a dynamic model (i.e., risk aversion process) for the client's risk preferences
      • an optimal investment criterion
    • the robo-advisor interacts repeatedly with the client and learns about changes in her risk profile
    • The robo-advisor adopts a multi-period mean-variance investment criterion with a finite investment horizon based on the estimate of the client's risk aversion level
  • RL Approach
    • an exploration-exploitation algorithm to learn the investor's risk appetite over time by observing her portfolio choices in different market environments

      • states: The set of various market environments of interest is formulated as the state space
      • controls: places an investor's capital into one of several pre-constructed portfolios
    • an investment robo-advising framework consisting of two agents

      • an inverse portfolio optimization agent, infers an investor's risk preference and expected return directly from historical allocation data using online inverse optimization
      • a deep RL agent, aggregates the inferred sequence of expected returns to formulate a new multi-period mean-variance portfolio optimization problem

Smart Order Routing

  • Dark Pools vs. Lit Pools
    • Dark pools are private exchanges for trading securities that are not accessible by the investing public
      • Dark pools were created in order to facilitate block trading by institutional investors who did not wish to impact the markets with their large orders and obtain adverse prices for their trades
      • three types of dark pools: (1) Broker-Dealer-Owned Dark Pools, (2) Agency Broker or Exchange-Owned Dark Pools, and (3) Electronic Market Makers Dark Pools.
    • Lit pools do display bid offers and ask offers in different stocks
      • primary exchanges operate in such a way that available liquidity is displayed at all times and form the bulk of the lit pools available to traders.
  • the most important characteristics of different dark pools
    • the chances of being matched with a counterparty
    • the price (dis)advantages
  • characteristics of lit pools
    • the order flows
    • queue sizes
    • cancellation rates

Further Developments for Mathematical Finance and Reinforcement Learning

  • Offline Learning and Online Exploration

  • Learning with a Limited Exploration Budget

  • Learning with Multiple Objectives

  • Learning to Allocate Across Lit Pools and Dark Pools

  • Robo-advising in a Model-free Setting

  • Sample Efficiency in Learning Trading Strategies

  • Transfer Learning and Cold Start for Learning New Assets

More on RL in Finance

Rao, A., & Jelvis, T. (2022). Foundations of Reinforcement Learning with Applications in Finance. CRC Press.

xx