Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

approach can be extended and modified to have multiple outputs for multiple ac-

tions, or to have multiple neural networks to represented either different actions or

different regions of the state space (Wiering 1995 ; Kaelbling et al. 1996 ; Lazaric

2008 ; Kalyanakrishnan and Stone 2009 ). Out of these rather rudimentary approaches,

none have been found to be superior for all applications, though many implemen-

tations have the common quality that they use networks with a single hidden layer

(Ghory 2004 ).

The use of neural networks for reinforcement learning is attractive for a number

of reasons. The first is that, unlike look-up tables but similar to linear methods, they

can generalize state values to states that have not been explicitly visited. In domains

with smooth and well-behaved value functions, this can be extremely useful and can

reduce the amount of training required to learn the domain. The second reason is

that neural networks are parameterized by a relatively small number of parameters,

especially when compared to the size of the state space of some domains. Finally,

neural networks do not necessarily require the use of carefully hand-crafter state

features as inputs. Rather, the hidden layer(s) of the neural network is(are) able to

derive implicit features that are deemed to be useful (Konen and Beielstein 2009 ),

though these derived features cannot often be interpreted (Günther 2008 ).

Despite these attractive properties of neural networks, neural networks are not

the dominant representation for most of the reinforcement learning community for

a number of reasons. Replacing a look-up table with a complex function approxi-

mator, such as a neural network, is not trivial and is viewed by some as not being

robust (Boyan and Moore 1995 ). The convergence proofs of reinforcement learn-

ing algorithms for linear methods have not been extended to non-linear function

approximators (Tsitsiklis and Roy 1996 , 1997 ), and these learning algorithms may

find suboptimal solutions or may diverge (Boyan and Moore 1995 ; Bertsekas and

Tsitsiklis 1996 ). The primary reason for this is that nonlinear methods tend to exag-

gerate small changes in the target function, and this exaggeration causes the value

iteration algorithm to become unstable (Thrun and Schwartz 1993 ; Gordon 1995 ).

It has also be suggested that the use of state discounting in reinforcement learn-

ing can potentially lead to instability with value iteration algorithms (Thrun and

Schwartz 1993 ).

The successful use of a neural network is dependent on the properties of the un-

derlying value function as well as on the ability of the neural network to approximate

the value function (Thrun and Schwartz 1993 ; Dietterich 2000 ). Though the abil-

ity of a neural network to generalize to unvisited states is attractive, this property

can be unfavorable when the value function is not smooth or is not well-behaved,

which may require longer training or could result in the learning algorithm diverging

(Riedmiller 2005 ). Such situations may arise when the numerical representations of

the state vectors are close together, yet their state values are quite different (Dayan

1993 ; Mahadevan and Maggioni 2007 ; Osentoski 2009 ). It has been suggested that

the dynamic capability of a neural network is dependent on its architecture (Loone

and Irwin 2001 ; Igel 2003 ), and thus a poorly chosen network architecture may

either perform poorly or may diverge during training (Hans and Udluft 2010 ), and

others claim that the neural network architecture must be tailored for the application

(Günther 2008 ).

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home