denotes the state space of the system. A state is the information which is available for the controller at time . Given this information an action has to be selected.
denotes the action space. Given a specific state at time , a certain subclass of actions may only be admissible.
is a stochastic transition kernel which gives the probability that the next state at time is in the set B if the current state is and action a is taken at time .
gives the (discounted) one-stage reward of the system at time if the current state is and action a is taken
gives the (discounted) terminal reward of the system at the end of the planning horizon.
A control is a sequence of decision rules () with where determines for each possible state the next action at time . Such a sequence is called policy or strategy. Formally the Markov Decision Problem is given by
Types of MDP problems:
finite horizon () vs. infinite horizon ()
complete state observation vs. partial state observation
problems with constraints vs. without constraints
total (discounted) cost criterion vs. average cost criterion
Research questions:
Does an optimal policy exist?
Has it a particular form?
Can an optimal policy be computed efficiently?
Is it possible to derive properties of the optimal value function analytically?
Applications: Consumption Problem
Consumption Problem
Suppose there is an investor with given initial capital. At the beginning of each of periods she can decide how much of the capital she consumes and how much she invests into a risky asset. The amount she consumes is evaluated by a utility function as well as the terminal wealth. The remaining capital is invested into a risky asset where we assume that the investor is small and thus not able to influence the asset price and the asset is liquid. How should she consume/invest in order to maximize the sum of her expected discounted utility?
Cash Balance or Inventory Problem
Imagine a company which tries to find the optimal level of cash over a finite number of periods. We assume that there is a random stochastic change in the cash reserve each period (due to withdrawal or earnings). Since the firm does not earn interest from the cash position, there are holding cost for the cash reserve if it is positive, but also interest (cost) in case it is negative. The cash reserve can be increased or decreased by the management at each decision epoch which implies transfer costs. What is the optimal cash balance policy?
Mean-Variance Problem
Consider a small investor who acts on a given financial market. Her aim is to choose among all portfolios which yield at least a certain expected return (benchmark) after periods, the one with smallest portfolio variance. What is the optimal investment strategy?
Dividend Problem in Risk Theory
Imagine we consider the risk reserve of an insurance company which earns some premia on the one hand but has to pay out possible claims on the other hand. At the beginning of each period the insurer can decide upon paying a dividend. A dividend can only be paid when the risk reserve at that time point is positive. Once the risk reserve got negative we say that the company is ruined and has to stop its business. Which dividend pay-out policy maximizes the expected discounted dividends until ruin?
Bandit Problem
Suppose we have two slot machines with unknown success probability and . At each stage we have to choose one of the arms. We receive one Euro if the arm wins, else no cash flow appears. How should we play in order to maximize our expected total reward over trials?
Pricing of American Options
In order to find the fair price of an American option and its optimal exercise time, one has to solve an optimal stopping problem. In contrast to a European option the buyer of an American option can choose to exercise any time up to and including the expiration time. Such an optimal stopping problem can be solved in the framework of Markov Decision Processes.
Markov Decision Model with planning horizon consists of a set of data with the following meaning for :
is the state space, endowed with a -algebra . The elements (states) are denoted by
is the action space, endowed with a -algebra . The elements (actions) are denoted by
is a measurable subset of and denotes the set of possible state-action combinations at time . We assume that contains the graph of a measurable mapping , i.e. for all . For ,the set is the set of admissible actions in state at time .
is a stochastic transition kernel from to , i.e. for any fixed pair , the mapping is a probability measure on and is measurable for all .The quantity gives the probability that the next state at time is in if the current state is and action a is taken at time . describes the transition law.
is a measurable function. gives the (discounted) one-stage reward of the system at time if the current state is and action a is taken.
is a measurable mapping. gives the (discounted) terminal reward of the system at time if the state is .
An equivalent definition of MDP
A Markov Decision Model is equivalently described by the set of data with the following meaning:
are as in Definition of last slide
is the disturbance space, endowed with a -algebra .
is a stochastic transition kernel for and and denotes the probability that is in if the current state is and action is taken.
is a measurable function and is called transition or system function. gives the next state of the system at time if at time the system is in state ,action is taken and the disturbance occurs at time .
Example: Consumption Problem
We denote by the random return of our risky asset over period . Further we suppose that are non-negative, independent random variables and we assume that the consumption is evaluated by utility functions . The final capital is also evaluated by a utility function . Thus we choose the following data:
where denotes the wealth of the investor at time
where denotes the wealth which is consumed at time
for all , i.e. we are not allowed to borrow money
where denotes the random return of the asset
is the transition function
distribution of (independent of )
is the one-stage reward
Decision rule & strategy
A measurable mapping with the property for all , is called a decision rule at time . We denote by the set of all decision rules at time .
A sequence of decision rules with is called an -stage policy or -stage strategy.
Value function:
Theorem: For it holds:
Finite Horizon Markov Decision Models
Finite Horizon Markov Decision Models
Integrability Assumption (): For
Assumption () is assumed to hold for the -stage Markov Decision Problems.
Example: (Consumption Problem) In the consumption problem Assumption () is satisfied if we assume that the utility functions are increasing and concave and for all , because then and can be bounded by an affine-linear function with and since a.s. under every policy, the function satisfies
For and a policy let be defined by
is the expected total reward at time over the remaining stages to if we use policy and start in state at time . The value function is defined by
is the maximal expected total reward at time over the remaining stages to if we start in state at time .
The functions and are well-defined since
The Bellman Equation
Let us denote by , we define the following operators:
For define
whenever the integral exists.
For and define
For define
is called the maximal reward operator at time .
Reward Iteration
Theorem
Let be an -stage policy. For it holds:
and ,
.
Example: (Consumption Problem) Note that for the operator in this example reads
Now let us assume that for all and . Moreover, we assume that the return distribution is independent of and has finite expectation . Then () is satisfied as we have shown before. If we choose the -stage policy with and , i.e. we always consume a constant fraction of the wealth, then the Reward Iteration implies by induction on that
Hence, with and maximizes the expected log-utility (among all linear consumption policies).
Maximizer, the Bellman Equation & Verification Theorem
Definition of a maximizer: Let . A decision is called a maximizer of at time if , i.e. for all , is a maximum point of the mapping ,
The Bellman Equation
Verification Theorem: Let be a solution of the Bellman equation. Then it holds:
for .
If is a maximizer of for , then and the policy is optimal for the -stage Markov Decision Problem.
The Structure Assumption & Structure Theorem
Structure Assumption (): There exists sets and such that for all :
If then is well-defined and
For all there exists a maximizer of with
Structure Theorem: Let () be satisfied. Then it holds:
and the sequence () satisfies the Bellman equation, i.e. for
For there exit maximizers of with , and every sequence of maximizers of defines an optimal policy for the -stage Markov Decision Problem
A corollary: Let () be satisfied. If $ n \leq m \leq N$ then it holds:
Backward Induction Algorithm
Principle of Dynamic Programming: Let () be satisfied, Then it holds for :
i.e. if is optimal for the time period then is optimal for .
The Financial Markets
Asset Dynamics and Portfolio Strategies
We assume that asset prices are monitored in discrete time
time is divided into periods of length and
multiplicative model for asset prices:
The binomial model(Cox-Ross-Rubinstein model) and discretization of the Black-Scholes-Merton model are two important special cases of the multiplicative model
An N-period financial market with risky assets and one riskless bond
A riskless bond with and
There are risky assets and the price process of asset is given by and
A portfolio or a trading strategy is an ()-adapted stochastic process where and for . The quantity denotes the amount of money which is invested into asset during the time interval .
The vector is called the initial portfolio of the investor. The value of the initial portfolio is given by
where denotes the inner product of the vectors and .
Let be a portfolio strategy and denote by the value of the portfolio at time before trading. Then
The value of the portfolio at time after trading is given by
Self-financing: A portfolio strategy is called self-financing if
for all , i.e. the current wealth is just reassigned to the assets.
Arbitrage opportunity: An arbitrage opportunity is a self-financing portfolio strategy with the following property: and
A theorem: Consider an -period financial market. The following two statements are equivalent:
There are no arbitrage opportunities.
For and for all -measurable it holds:
Modeling and Solution Approaches with MDP
Modeling approach: specify the elements of the MDP
state space , action space
transition function:
value function
the Bellman equation:
Solution approaches
backward induction (with the Bellman equation)
the existence and form of optimal policy
we are interested in the structured properties that are preserved in the iterations
MDP Applications in Finance
A Cash Balance Problem
involves the decision about the optimal cash level of a firm over a finite number of periods
a random stochastic change in the cash reserve each period (which can be both positive and negative).
costs
holding cost or opportunity cost for the cash reserve if it is positive
out-of-pocket expense
The cash reserve can be increased or decreased by the management at the beginning of each period
transfer costs
random changes in the cash flow are given by independent and identically distributed random variables () with finite expectation
The cost have to be paid at the beginning of a period for cash level
, ,
is convex
Problem Formulation
Elements of MDP
where denotes the cash level,
where denotes the new cash level after transfer,
,
where denotes the cash change,
,
,
,
The state transition function:
The value function
The Bellman equation
Solution
Solution is worked through by backward induction
We verified the validity of Integrability Assumption () and the relative Structure Assumption () for each
The solution to the cash balance problem (Theorem 2.6.2):
There exist critical levels and such that for
with .
There exist critical levels and such that for
Consumption and Investment Problems
The investor has an initial wealth and at the beginning of each of periods she can decide how much of the wealth she consumes and how much she invests into the financial market.
The amount which is consumed at time is evaluated by a utility function . The remaining wealth is invested in the risky assets and the riskless bond, and the terminal wealth yields another utility .
How should the agent consume and invest in order to maximize the sum of her expected utilities?
Formulation
Assumption (FM)
There is no arbitrage opportunity.
for all .
The utility functions and satisfy .
The wealth process ()evolves as follows
where is a consumption-investment strategy, i.e. ()and ()are -adapted and .
The consumption-investment problem is then given by
Elements of MDP
where denotes the wealth,
where is the amount of money invested in the risky assets and the amount which is consumed,
is given by
where denotes the relative risk
,
.
The value function
Solution
Let and be utility functions with . Then it holds:
There are no arbitrage opportunities if and only if there exists a measurable function such that
is strictly increasing, strictly concave and continuous on .
For the multiperiod consumption-investment problem it holds:
The value functions are strictly increasing, strictly concave and continuous.
The value functions can be computed recursively by the Bellman equation
There exist maximizers of and the strategy is optimal for the -stage consumption-investment problem.
Terminal Wealth Problems
Suppose we have an investor with utility function with or and initial wealth . A financial market with risky assets and one riskless bond is given. Here we assume that the random vectors are independent but not necessarily identically distributed. Moreover we assume that is the filtration generated by the stock prices, i.e. . We make the (FM) assumption for the financial market.
Our agent has to invest all the money into this market and is allowed to rearrange her portfolio over stages. The aim is to maximize the expected utility of her terminal wealth.
Formulation
The wealth process () evolves as follows
where is a portfolio strategy. The optimization problem is then
Elements of the MDP
where denotes the wealth,
where is the amount of money invested in the risky assets,
,
where denotes the relative risk,
,
,
, and .
Solution
For the multiperiod terminal wealth problem it holds:
The value functions are strictly increasing, strictly concave and continuous.
The value functions can be computed recursively by the Bellman equation
There exist maximizers of and the strategy is optimal for the -stage terminal wealth problem.
Portfolio Selection with Transaction Costs
We consider now the utility maximization problem under proportional transaction costs. For the sake of simplicity we restrict to one bond and one risky asset. If an additional amount of (positive or negative) is invested in the stock, then proportional transaction costs of are incurred which are paid from the bond position. We assume that . In order to compute the transaction costs, not only is the total wealth interesting, but also the allocation between stock and bond matters. Thus, in contrast to the portfolio optimization problems so far, the state space of the Markov Decision Model is two-dimensional and consists of the amounts held in the bond and in the stock.
Formulation
Elements of the MDP
where denotes the amount invested in bond and stock,
where denotes the amount invested in bond and stock after transaction,
,
where denotes the relative price change of the stock,
,
,
,
.
Dynamic Mean-Variance Problems
We use the same non-stationary financial market as for the Terminal Wealth Problems with independent relative risk variables. Our investor has initial wealth . This wealth can be invested into risky assets and one riskless bond. How should the agent invest over periods in order to find a portfolio with minimal variance which yields at least an expected return of ?
Formulation
Elements of the MDP
where denotes the wealth,
where is the amount of money invested in the risky assets,
,
where denotes the relative risk,
,
distribution of (independent of ).
The original formulation (MV)
An equivalent formulation (MV)
Assumption (FM):
and for all .
The covariance matrix of the relative risk process
is positive definite for all .
.
Problem (MV) can be solved via the well-known Lagrange multiplier tecnique. Let be the Lagrange-function, i.e.
The Lagrange-problem for the parameter
A stochastic LQ-problem
If is optimal for ,then is optimal for with .
Elements od MDP
,
Solution: For the mean-variance problem (MV) it holds:
The value of is given by
where is given in (4.34). Note that .
The optimal portfolio strategy is given by
Dynamic Mean-Risk Problems
Index Tracking
Suppose we have a financial market with one bond and d risky assets. Besides the tradeable assets there is a non-tradable asset whose price process evolves according to
The positive random variable which is the relative price change of the non-traded asset may be correlated with . It is assumed that the random vectors are independent and the joint distribution of is given.
The aim now is to track the non-traded asset as closely as possible by investing into the financial market. The tracking error is measured in terms of the quadratic distance of the portfolio wealth to the price process , i.e. the optimization problem is then
where is -measurable.
Formulation
Elements of the MDP
where and is the wealth and the value of the non-traded asset,
where is the amount of money invested in the risky assets,
,
where and is the relative risk of the traded assets and is the relative price change of the non-traded asset.