Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

each (state-action) couple but not each feasible (state-action) trajectory; oth-

erwise, the size of the data will grow exponentially with time.

One may also choose to randomize the elementary costs for each transition.

This generalization is very easy to address when the criterion is the expected

cost because the random cost of a transition is immediately replaced by its

expectation.

5.3.3 Definition of a Decision Markov Problem

5.3.3.1 Controlled Markov Chain

The previous example is formalized with the following definition, which is

limited to the case of finite state space and action set for the sake of simplicity.

A Markov decision problem (MDP) consists in the following ingredients: a

control Markov chain, an elementary cost function, a horizon length and,

possibly, either a terminal cost (if the problem is a finite horizon problem), or

a discount rate (if the problem is an infinite horizon problem).

We met previously the concept of controlled Markov process, which is

the stochastic analog of a controlled dynamical system. Let us give a precise

definition.

A controlled Markov chain consists in the following ingredients: a state

space E , an action set A , a subset A

A of the feasible (state-action)

set and an application p from A into the set of probabilities upon E .That

application takes as input any feasible (state-action) couple ( x,u ) and returns

the probability denoted P u ( x,y ) of going to the state y when action u is

performed from state x .

⊂

E

×

Remark. P u is a probability law and not a probability density; it is a tran-

sition probability kernel.

Thus, from the initial (state-action) couple, the probability of the (state-

action) N -trajectory

ω =(( x 0 ,a 0 ) , ( x 1 ,a 1 ) ,..., ( x N− 1 ,a N− 1 ) , ( x N ))

is equal to

P ( ω )= P a 0 ( x 0 ,x 1 ) P a 1 ( x 1 ,x 2 ) ...P a N− 1 ( x N− 1 ,x N ) .

A( feasible ) policy of the controlled Markov chain is an application π from

E

N into A such that, for any state x and for any time k , the (state-action)

couple ( x,π ( x,k )) is feasible.

If policy π does not depend on time it is called a stationary policy .In

order to simplify notations, a stationary policy will be also denoted by π as a

function of the state. Any stationary policy π defines a Markov chain, whose

transition probability P π

×

is defined by:

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home