Civil Engineering Reference
In-Depth Information
Reinforcement learning is more formally described as a method to determine op-
timal action selection policies in sequential decision making processes. The general
framework is based on sequential interactions between an agent and its environ-
ment, where the environment is characterized by a set of states X , and the agent
can pursue actions from the set of admissible actions A . The agent interacts with
the environment and transitions from state x ( t )
x to state x ( t + 1)
x
=
=
by select-
ing an action a ( t )
=
a . Transitions between states are often based on some defined
a
probability
xx . In this particular section, the time index of variables is denoted
using a parenthetic-superscript or using prime notation (i.e., apostrophe), whichever
is clearest or notationally cleanest, and subscripts are reserved for denoting other
entities. The time index denotes the particular time step at which states are visited,
actions are pursued, or rewards are received, and thus the sequences of states and
actions can then be thought of as a progression through time.
The sequential decision making process therefore consists of a sequence of states
P
x (0) , x (1) , ... , x ( T )
a (0) , a (1) , ... , a ( T )
x
={
}
and a sequence of actions a
={
}
for
0, 1, ... , T , where state x (0) is the initial state and where x ( T ) is the
terminal state. Feedback may be provided to the agent at each time step t in the
form of a reward r ( t ) based on the particular action pursued. This feedback may be
either rewarding (positive) for pursuing actions that lead to beneficial outcomes, or
they may be aversive (negative) for pursuing actions that lead to worse outcomes.
The terms feedback and reward will be used interchangeably, and it is important
to note that a reward may be positive or negative, despite its conventional positive
connotation. Processes that provide feedback at every time step t have an associated
reward sequence r ={ r (1) , r (2) , ... , r ( T )
time steps t
=
}
. A complete process of agent-environment
interaction from t
0, 1, ... , T is referred to as an episode consisting of the set of
states visited, actions pursued, and rewards received. The total reward received for a
single episode is the sum of the rewards received during the entire decision making
process,
=
R = t = 1 r ( t ) .
For the general sequential decision making process define above, the transition
probabilities between states x ( t )
x may be dependent on the
complete sequences of previous states, actions, and rewards:
Pr x ( t + 1)
x and x ( t + 1)
=
=
x ( t ) , a ( t ) , r ( t ) , x ( t 1) , a ( t 1) , r ( t 1) , ... , x (1) , a (1) , r (1) , x (0) , a (0)
(2.1)
x , r ( t + 1)
=
=
r
|
Processes in which the state transition probabilities depend only the most re-
cent state information ( x ( t ) , a ( t ) ) satisfy the Markovian assumption. Under this
assumption, the state transition probabilities can be expressed as:
Pr x ( t + 1)
x ( t ) , a ( t )
x , r ( t + 1)
=
=
r
|
which is equivalent to the expression in Eq. 2.1 because all information from
t
=
1 has no effect on the state transition at time t .
If a process can be assumed to be Markovian and can be posed as a problem
following the general formulation defined above, this process is amenable to both
0, 1, ... , t
Search WWH ::




Custom Search