Information Technology Reference
In-Depth Information
Fig. 5.13. Tree of the 2-step trajectories obtained from initial state 24 in the
labyrinth of Fig. 4.1 with constant policy GO EASTWARDS
For instance, in our example, we can consider that there is some control
noise: the evolution does not associate to each (state-action) couple a single
state, but it associates a random variable that takes its value in the state
space. For instance, we select the nominal state at next time with probability
0.8 and the two neighbor states with probability 0.1. For each transition, the
random variables are independent.
Therefore, when the constant policy GO EASTWARDS is applied, a
Markov chain is generated. Figure 5.13 shows the trajectories of horizon 2
of that Markov chain from the initial state 24 on a ternary tree.
That tree consists in 3 2 = 9 trajectory sections. The result of the first
transition that is associated to the (state-action) couple (24, E) is a random
variable that takes the 3 values 14, 24, 34 in the state space with the respective
probabilities 0.1, 0.8 and 0.1. At the next step, according to the current state,
a transition is performed from one of the (state-action) couples (14, E), (24,
E) or (34, E). For instance the result of the transition, which is associated
to (14, E) is a random variable. It is independent from the previous one and
takes values 24, 15, 14 with probabilities 0.1, 0.8 and 0.1 respectively.
The probability of an N -horizon trajectory (i.e. with N transitions) is
computed by multiplying the elementary probabilities of the transitions for
each step. For instance, the probability of the 2-trajectory ((24, E), (14, E),
(24)) is equal to 0 . 1
0 . 1=0 . 01. The probability of the interesting 2-trajectory
((24, E), (34, E) (35)) is equal to 0 . 1
×
0 . 8=0 . 08. The valuation of a policy
involves the computations of the probability of every trajectory.
Therefore, even more clearly than in the deterministic problem, it is im-
possible to enumerate all the trajectories in order to compute the policy cost,
because combinatorial explosion makes it quickly intractable. Appropriate
methods are needed to value a policy. Those methods will be developed in a
further section, which will be dedicated to reinforcement learning and neuro-
dynamic programming. We are looking for an algorithm that enables to value
×
Search WWH ::




Custom Search