Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

When modeling that example, it is more natural to take a unit cost for

each (state-action) couple and to choose a terminal cost equal to

A for the

target 35 and to A for any other terminal state. Thus, one gets the following

values for the previous trajectories:

J N ( ω 1 )=10+ A, J N ( ω 2 )=10+ A,

J N ( ω 3 )=10+ A, J N ( ω 4 )=10

−

A....

Unfortunately, the horizon is not precisely known in most applications. There-

fore, infinite horizon problems are often considered. In those problems, it is

not always possible to define the total cost as the sum of elementary costs

for each step. Actually, the sequence of elementary costs may diverge for an

infinite number of steps. There are several possible ways of defining the total

cost of an infinite length trajectory.

One can define, when it exists, the limit of the average cost on the N

first steps of the trajectory when N goes to infinity. In our simple problem,

that solution does not make sense: it amounts to assign the cost

A to each

trajectory ending on the desired target, and the cost 1 to any other trajectory.

The trajectories that reach the desired target quickly are not favored.

When the problem is to reach a specified target within a finite number of

steps, it is possible to take the effective sum of the elementary costs as the

total cost just as in finite horizon problems. It is the case in our example.

In the general case, we suggest taking the discounted cost as a global

criterion to minimize. That cost function is inspired from economics, where

future costs are discounted by a discount rate α here 0 <α< 1.

Thus, in our problem, for an infinite horizon model, we have

−

1

J α ( ω 1 )= J α ( ω 2 )= J α ( ω 3 )=1+ α + α 2 +

···

=

1

−

α

Aα 3

1

J α ( ω 4) = 1 + α + α 2

Aα 3

Aα 4

=1+ α + α 2

−

−···

−

α .

−

Thus, shorter trajectories are favored.

The problem consists in finding an optimal policy π ∗ such that the total

cost of the state-action trajectory associated to that policy is minimal for any

initial state.

5.3.2 Example of a Markov Decision Problem

A Markov decision problem (MDP) is the generalization to a stochastic frame-

work of the previous problem. Randomness is introduced in the state evolution

model and in costs. The total cost is then a random variable

.Wehaveto

minimize a functional, which is just the mathematical expectation 1

J

of that

random variable J = E ( J ).

1 Let us recall that the mathematical expectation of a random variable is the av-

erage of this random variable for its probability distribution. As dynamics is

considered here, the probability law is defined on the set of trajectories.

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home