Algorithm 1 Method for estimating
1: Input: total number of iterations ; the policy used to sample observations; rule to set learning rate
2: Initialize for all
3: Initialize
4: for do
5: Sample action according to
6: Observe and after taking action
7: Update with
8:
9: end for

Algorithm 2 -learning with samples from a policy
1: Input: total number of iterations ; the policy to be evaluated; rule to set learning rate

2: Initialize for all and
3: Initialize
4: for do
5: Sample action according to
6: Observe and after taking action
7: Update with sample
8:
9: end for

Algorithm 3 SARSA: On-policy TD Learning
1: Input: total number of iterations , the learning rate , and small parameter .
2: Initialize for all and
3: Initialize
4: Initialize : Choose when in using a policy derived from (e.g, -greedy)
5: for do
6: Take action , observe reward and the next step
7: Choose when in using a policy derived from (e.g., -greedy)
8:
9: end for

Algorithm 4 REINFORCE: Monte Carlo Policy Gradient
1: Input: total number of iterations , learning rate , a differentiable policy parametrization
, finite length of the trajectory .
2: Initialize policy parameter .
3: for do
4: Sample a trajectory
5: Initialize
6: for do
7: Calculate the return:
8: Update the policy parameter
9: end for
10: end for

Algorithm 5 Actor-Critic Algorithm
1: Input: A differentiable policy parametrization , a differentiable -function parameterization , learning rates and , number of iterations
2: Initialize policy parameter and -function parameter
3: Initialize
4: for do
5: Sample
6: Take action , observe state and reward
7: Sample action
8: Compute the TD error:
9: Update
10:
11:
12: end for

Algorithm 6 The DDPG Algorithm
1: Input: an actor , a critic network , learning rates and , initial parameters and
2: Initialize target networks parameters and ,
3: Initialize replay buffer
4: for do
5: Initialize state
6: for do
7: Select action with
8: Execute action and observe reward and observe new state
9: Store transition in
10: if then
11: Sample a random mini-batch of transitions from
12: Set the target .
13: Update the critic by minimizing the loss: with
14: Update the actor by using the sampled policy gradient: with
15: Update the target networks:
16: end if
17: end for
18: end for

RL Approach
- an exploration-exploitation algorithm to learn the investor's risk appetite over time by observing her portfolio choices in different market environments
  - states: The set of various market environments of interest is formulated as the state space
  - controls: places an investor's capital into one of several pre-constructed portfolios
- an investment robo-advising framework consisting of two agents
  - an inverse portfolio optimization agent, infers an investor's risk preference and expected return directly from historical allocation data using online inverse optimization
  - a deep RL agent, aggregates the inferred sequence of expected returns to formulate a new multi-period mean-variance portfolio optimization problem

Advances of RL in Finance

The Basics of Reinforcement Learning