We write the value function as
Finite Time Horizon
subject to
We write the value function as
for all
Theorem Assume
Typically fcatures are assumed to be known to the agent and bounded, namely,
Simulator
Randomized Policy vs Deterministic Policy
Sample complexity: the total number of samples required to find an approximately optimal policy
Sample complexity for episodic RL
The agent can update her estimate of the value function
The TD update can also be written as
Algorithm 1 |
---|
1: Input: total number of iterations |
2: Initialize |
3: Initialize |
4: for |
5: |
6: |
7: |
8: |
9: end for |
Algorithm 2 |
---|
1: Input: total number of iterations |
2: Initialize |
3: Initialize |
4: for |
5: |
6: |
7: |
8: |
9: end for |
Theorem (Watkins and Dayan (1992)). Assume
then
SARSA adopts a policy which is based on the agent's current estimate of the
Algorithm 3 SARSA: On-policy TD Learning |
---|
1: Input: total number of iterations |
2: Initialize |
3: Initialize |
4: Initialize |
5: for |
6: |
7: |
8: |
9: end for |
On-policy Learning vs. Off-policy Learning
Theorem (Policy Gradient Theorem). Assume that
Algorithm 4 REINFORCE: Monte Carlo Policy Gradient |
---|
1: Input: total number of iterations |
2: Initialize policy parameter |
3: for |
4: |
5: |
6: |
7: |
8: |
9: |
10: end for |
Algorithm 5 Actor-Critic Algorithm |
---|
1: Input: A differentiable policy parametrization |
2: Initialize policy parameter |
3: Initialize |
4: for |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: end for |
Parametrize value functions
Neural Fitted
Deep Q-Network (DQN)
Double Deep
Algorithm 6 The DDPG Algorithm |
---|
1: Input: an actor |
2: Initialize target networks parameters |
3: Initialize replay buffer |
4: for |
5: |
6: |
7: |
8: |
9: |
10: |
11: |
12: |
13: |
14: |
15: |
16: |
17: |
18: end for |
with
Evaluation Criteria
Benchmark Algorithms
Stochastic Control Approach.
an exploration-exploitation algorithm to learn the investor's risk appetite over time by observing her portfolio choices in different market environments
an investment robo-advising framework consisting of two agents
Offline Learning and Online Exploration
Learning with a Limited Exploration Budget
Learning with Multiple Objectives
Learning to Allocate Across Lit Pools and Dark Pools
Robo-advising in a Model-free Setting
Sample Efficiency in Learning Trading Strategies
Transfer Learning and Cold Start for Learning New Assets