Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

These problems are simpler than their real-world and end-goal counterparts, but the

insights that can be gained from them can be valuable in achieving the end-goal.

For example, the multi-armed bandit problem is where an agent is attempts to maxi-

mize its reward from multiple single-armed bandits (i.e., slot machines), where each

bandit has a different expected reward. Learning how to maximize performance in

this domain requires both exploitation of knowledge and exploration of potentially

better actions. This problem scenario may be used to gain an understanding of the

dynamics of a problem, and such knowledge has been used to aid in the application of

similar methods to problems in energy management or robotics (Galichet et al. 2013 ;

Karnin et al. 2013 ).

2.1.4

Generalized Domains

There is another class of domains to which reinforcement learning has been applied

that deserves attention. These domains are not based on any physical, real-world, or

game-like domain. Rather they could instead be considered generalized domains that

are either completely abstract environments or are environments that have parameter-

ized characteristics. The term abstract in this sense refers to an environment which

lacks a physical representation and is merely a numerical representation with defined

properties, dynamics, and constraints. While all of the environments mentioned in

this section have not been used with reinforcement learning specifically, they have

been used with some form of sequential decision making-related algorithm.

One of the original purely abstract generalized domains was for infinite horizon

problems developed by Archibald et al. ( 1995 ) out of a need for a more flexible

and widely available platform for assessing the characteristics of Markov decision

processes. This domain is highly parameterized, using 14 variables that determine

the dynamics of the domain; one of the novelties of this work was that it allowed for

the control of the mixing speed of the problem (Aldous 1983 ), which is related to the

relative importance of the starting state of the process. However, this type of domain

is based on enumerated states that are used with look-up table representations, rather

than state vectors that are commonly used in present day problems.

Garnet (Generalized Average Reward Non-stationary Environment Testbed) prob-

lems are a similar type of generalized abstract domain that have been developed more

recently, though these domains are for episodic learning problems (Bhatnagar et al.

2009 ; Castro and Mannor 2010 ; Awate 2009 ), as is considered in the present work.

The domains generated by Bhatnagar et al. ( 2009 ) are based on a simpler, five param-

eter domain that is defined by: (1) the number of states, (2) the number of actions,

(3) the branching factor, (4) the standard deviation of the reward(s), and (5) the

stationarity. Two additional parameters are the dimensionality and the number of

non-zero components of the state vectors. The domains created by Awate ( 2009 ) and

Castro and Mannor ( 2010 ) omit the stationarity parameter, but otherwise use the

same domain construction method.

Search WWH ::

Custom Search

Home