Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

Reinforcement learning is more formally described as a method to determine op-

timal action selection policies in sequential decision making processes. The general

framework is based on sequential interactions between an agent and its environ-

ment, where the environment is characterized by a set of states X , and the agent

can pursue actions from the set of admissible actions A . The agent interacts with

the environment and transitions from state x ( t )

x to state x ( t + 1)

by select-

ing an action a ( t )

a . Transitions between states are often based on some defined

probability

xx . In this particular section, the time index of variables is denoted

using a parenthetic-superscript or using prime notation (i.e., apostrophe), whichever

is clearest or notationally cleanest, and subscripts are reserved for denoting other

entities. The time index denotes the particular time step at which states are visited,

actions are pursued, or rewards are received, and thus the sequences of states and

actions can then be thought of as a progression through time.

The sequential decision making process therefore consists of a sequence of states

x (0) , x (1) , ... , x ( T )

a (0) , a (1) , ... , a ( T )

}

and a sequence of actions a

}

for

0, 1, ... , T , where state x (0) is the initial state and where x ( T ) is the

terminal state. Feedback may be provided to the agent at each time step t in the

form of a reward r ( t ) based on the particular action pursued. This feedback may be

either rewarding (positive) for pursuing actions that lead to beneficial outcomes, or

they may be aversive (negative) for pursuing actions that lead to worse outcomes.

The terms feedback and reward will be used interchangeably, and it is important

to note that a reward may be positive or negative, despite its conventional positive

connotation. Processes that provide feedback at every time step t have an associated

reward sequence r ={ r (1) , r (2) , ... , r ( T )

time steps t

}

. A complete process of agent-environment

interaction from t

0, 1, ... , T is referred to as an episode consisting of the set of

states visited, actions pursued, and rewards received. The total reward received for a

single episode is the sum of the rewards received during the entire decision making

process,

R = t = 1 r ( t ) .

For the general sequential decision making process define above, the transition

probabilities between states x ( t )

x may be dependent on the

complete sequences of previous states, actions, and rewards:

Pr x ( t + 1)

x and x ( t + 1)

x ( t ) , a ( t ) , r ( t ) , x ( t − 1) , a ( t − 1) , r ( t − 1) , ... , x (1) , a (1) , r (1) , x (0) , a (0)

(2.1)

x , r ( t + 1)

Processes in which the state transition probabilities depend only the most re-

cent state information ( x ( t ) , a ( t ) ) satisfy the Markovian assumption. Under this

assumption, the state transition probabilities can be expressed as:

Pr x ( t + 1)

x ( t ) , a ( t )

x , r ( t + 1)

which is equivalent to the expression in Eq. 2.1 because all information from

−

1 has no effect on the state transition at time t .

If a process can be assumed to be Markovian and can be posed as a problem

following the general formulation defined above, this process is amenable to both

0, 1, ... , t

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home