Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

the terminal state from any state. Such stationary policies are called proper

stationary policies. Therefore, the terminal state is the equilibrium state (ac-

tually deterministic) of the controlled Markov chain that is associated to any

proper stationary policy.

For infinite horizon problems, because elementary costs are stationary and

without any terminal cost, there is no point in looking for an optimal non-

stationary policy. For a given state, the optimal action does not depend on

current time.

We set that elementary cost of the identical transition from terminal state

to zero, and the elementary cost of any other transition to strictly positive

values; therefore, the latter is lower bounded by a strictly positive constant

since state space is finite.

The expected total cost of a stationary policy π is defined by

J π ( x ) =

lim

N→∞

P π ( x,x 1 ) P π ( x 1 ,x 2 ) ...P π ( x N− 1 ,x N )

( x 1 ,...x N ) ∈E N

c ( x,π ( x ) ,x 1 )+ N− 1

c ( x k ,π ( x k ) ,x k +1 ) .

×

k =1

This may be written more formally using probability theory notation

c ( x,π ( x ) ,X 1 )+ ∞

c ( X k ,π ( X k ) ,X k +1 ) ,

J π ( x )= E P π,x

k =1

where P π,x is the probability distribution of the Markov chain that is associ-

ated to the stationary policy π and the initial state x .

One infers that if a stationary policy is not proper, there exists at least

one initial state such that the expected total cost from that state is infinite.

The shortest stochastic path problem consists in finding optimal policy π ∗

that minimizes the cost function J π .

5.3.3.4 Infinite Horizon Problem with Discounted Cost

Arealnumber α strictly between 0 and 1 is given. It is called the discount

rate.

To any stationary policy π and any discount rate α we associate a cost

function J π from E into R . It takes the state x as input and returns the

expected cost from initial state x for the Markov chain with transition prob-

ability kernel P π .

J π ( x ) = lim

N→∞

P π ( x,x 1 ) P π ( x 1 ,x 2 ) ...P π ( x N− 1 ,x N )

( x 1 ,...x N ) ∈E N

c ( x,π ( x ) ,x 1 )+ N− 1

α k c ( x k ,π ( x k ) ,x k +1 ) .

×

k =1

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home