Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

Fig. 5.13. Tree of the 2-step trajectories obtained from initial state 24 in the

labyrinth of Fig. 4.1 with constant policy GO EASTWARDS

For instance, in our example, we can consider that there is some control

noise: the evolution does not associate to each (state-action) couple a single

state, but it associates a random variable that takes its value in the state

space. For instance, we select the nominal state at next time with probability

0.8 and the two neighbor states with probability 0.1. For each transition, the

random variables are independent.

Therefore, when the constant policy GO EASTWARDS is applied, a

Markov chain is generated. Figure 5.13 shows the trajectories of horizon 2

of that Markov chain from the initial state 24 on a ternary tree.

That tree consists in 3 2 = 9 trajectory sections. The result of the first

transition that is associated to the (state-action) couple (24, E) is a random

variable that takes the 3 values 14, 24, 34 in the state space with the respective

probabilities 0.1, 0.8 and 0.1. At the next step, according to the current state,

a transition is performed from one of the (state-action) couples (14, E), (24,

E) or (34, E). For instance the result of the transition, which is associated

to (14, E) is a random variable. It is independent from the previous one and

takes values 24, 15, 14 with probabilities 0.1, 0.8 and 0.1 respectively.

The probability of an N -horizon trajectory (i.e. with N transitions) is

computed by multiplying the elementary probabilities of the transitions for

each step. For instance, the probability of the 2-trajectory ((24, E), (14, E),

(24)) is equal to 0 . 1

0 . 1=0 . 01. The probability of the interesting 2-trajectory

((24, E), (34, E) (35)) is equal to 0 . 1

×

0 . 8=0 . 08. The valuation of a policy

involves the computations of the probability of every trajectory.

Therefore, even more clearly than in the deterministic problem, it is im-

possible to enumerate all the trajectories in order to compute the policy cost,

because combinatorial explosion makes it quickly intractable. Appropriate

methods are needed to value a policy. Those methods will be developed in a

further section, which will be dedicated to reinforcement learning and neuro-

dynamic programming. We are looking for an algorithm that enables to value

×

Search WWH ::

Custom Search

Home