Information Technology Reference
In-Depth Information
actions. Very often now, in complex practical applications, the knowledge
of the process is gathered in a computer simulator. The simulator program
models real events, so that the knowledge of the transition probabilities is
not straightforward. Sometimes, one has to estimate them by performing
a first complex set of simulations.
These considerations lead searchers to use simulation straightforwardly to
define the optimal policy without identifying the model by estimating the
transition probabilities.
The simplest use of a Monte Carlo method to value a policy consists in
simulating a large number of trajectories from each initial state and then in
computing the average cost over the trajectory set. In the same way, one can
estimate the value function by averaging the trajectory cost over a gener-
ated trajectory set for every initial feasible (state-action) couple, applying the
current policy after the first transition.
The advantage of the Monte Carlo method over exact methods is that
it may be implemented even when the mathematical model is not known,
provided it is possible to perform intensive experiments or simulations. The
optimal policy is no longer determined from the model, but from experi-
ments and environment response. The environment response is called the
reinforcement signal . When this signal is positive, the current modifica-
tions of policy that are under testing phase are validated, when it is negative
they are discarded. This type of learning process is called reinforcement learn-
ing. This terminology is relative to the same conceptual line as Actor-critics
methodology that was examined previously. Reinforcement learning always
attracted researchers, especially in Artificial Intelligence, because it is gen-
erally considered that adaptation mechanisms of living systems are ruled by
such principles (in particular Pavlov's work on reflex and other psychologists'
work of last century may be addressed). Reinforcement learning was first de-
veloped independently from neural learning [Barto et al. 1983]. In this section,
we will describe usual methods that widely outperform simple Monte Carlo
simulations, which were just explained above as an introduction.
Actually, the complexity of the straightforward Monte Carlo method that
has just been mentioned here is generally too large for usual applications.
When the model is known, its complexity may be larger than the complexity
of direct linear system inversion and, even when the model is not known, it may
be impossible to get a result within reasonable time. Moreover, this method
throws away useful information. Indeed, transition costs of a given trajectory
gives us information not only on the value of the cost function on the initial
state of the trajectory but also on the values of the cost function on all the
intermediate states of the trajectory. In the following section, we describe a
method that takes advantage of the information provided by experiments or
simulations in a more e cient way.
We will describe it in the framework of Markov decision problems with
infinite horizon and discounted cost. As previously, the discounted cost is
Search WWH ::




Custom Search