Information Technology Reference
In-Depth Information
each (state-action) couple but not each feasible (state-action) trajectory; oth-
erwise, the size of the data will grow exponentially with time.
One may also choose to randomize the elementary costs for each transition.
This generalization is very easy to address when the criterion is the expected
cost because the random cost of a transition is immediately replaced by its
expectation.
5.3.3 Definition of a Decision Markov Problem
5.3.3.1 Controlled Markov Chain
The previous example is formalized with the following definition, which is
limited to the case of finite state space and action set for the sake of simplicity.
A Markov decision problem (MDP) consists in the following ingredients: a
control Markov chain, an elementary cost function, a horizon length and,
possibly, either a terminal cost (if the problem is a finite horizon problem), or
a discount rate (if the problem is an infinite horizon problem).
We met previously the concept of controlled Markov process, which is
the stochastic analog of a controlled dynamical system. Let us give a precise
definition.
A controlled Markov chain consists in the following ingredients: a state
space E , an action set A , a subset A
A of the feasible (state-action)
set and an application p from A into the set of probabilities upon E .That
application takes as input any feasible (state-action) couple ( x,u ) and returns
the probability denoted P u ( x,y ) of going to the state y when action u is
performed from state x .
E
×
Remark. P u is a probability law and not a probability density; it is a tran-
sition probability kernel.
Thus, from the initial (state-action) couple, the probability of the (state-
action) N -trajectory
ω =(( x 0 ,a 0 ) , ( x 1 ,a 1 ) ,..., ( x N− 1 ,a N− 1 ) , ( x N ))
is equal to
P ( ω )= P a 0 ( x 0 ,x 1 ) P a 1 ( x 1 ,x 2 ) ...P a N− 1 ( x N− 1 ,x N ) .
A( feasible ) policy of the controlled Markov chain is an application π from
E
N into A such that, for any state x and for any time k , the (state-action)
couple ( x,π ( x,k )) is feasible.
If policy π does not depend on time it is called a stationary policy .In
order to simplify notations, a stationary policy will be also denoted by π as a
function of the state. Any stationary policy π defines a Markov chain, whose
transition probability P π
×
is defined by:
Search WWH ::




Custom Search