Civil Engineering Reference
In-Depth Information
modeling and analysis. The Markov decision process (MDP) is a specific process
which entails sequential decisions and consists of a set of states X , a set of actions
A , and a reward function R ( x ). Each state x X has an associated true state value
V ( x ). The decision process consists of repeated agent-environment interactions,
during which the agent learns to associate states visited during the process with
outcomes of entire process, thus attempting to learn the true value of each state.
However, the true state values are unobservable to the agent, and thus the agents'
estimates of the state values V ( x ) are merely approximations of the true state values.
There are many cases where the Markovian assumption may not be valid. Such
cases include those where the agent either cannot perfectly observe the state infor-
mation, in which case the problem is referred to as a partially-observable Markov
decision processes (POMDPs) (Singh et al. 1994 ), or if there is a long temporal
dependence between states and the feedback provided. Approaches to solving these
types of problem often include retaining some form of the state history (Wierstra
et al. 2007 ), such as by using recurrent neural networks or a more complex variant
that relies on long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997 ).
These approaches, however, are not considered in this work.
The transition between states x ( t )
x and x ( t + 1)
x when pursuing action a may
=
=
P
xx =
a
{
x ( t + 1)
=
|
x ( t )
=
x , a ( t )
=
}
be represented by a transition probability
.
For problems in which the transition probabilities between all states are known, the
transition probabilities can be used to formulate an explicit model of the agent-
environment interaction process. A policy ˀ is defined to be the particular sequence
of actions a ={ a (0) , a (1) , ... , a ( T )
Pr
x
a
that is selected throughout the decision making
process. Similarly, the expected reward for transitioning from state x to state x
when pursuing action a may be represented by
}
a
xx
= E{ r ( t + 1)
| x ( t )
= x , a ( t )
R
=
a , x ( t + 1)
. An optimal policy ˀ is considered to be the set of actions pursued
during the process maximizes the total cumulative reward
x }
=
.
The optimal policy can be determined if the true state values V ( x ) are known
for every state. The true state values can be expressed using the Bellman optimality
equation (Bertsekas 1987 ), which states that the value of being in state x and pursuing
action a is a function of both the expected return of the current state and the optimal
policy for all subsequent states:
R
E r ( t + 1)
ʳV x ( t + 1)
a
V ( x )
x ( t )
x , a ( t )
=
max
a
+
|
=
=
(2.2)
A
xx R
xx + ʳV ( x )
a
a
=
max
a A
P
(2.3)
x +
The Bellman equation provides a conceptual solution to the sequential decision
making problem in that the value of being in a state is a function of all future
state values. The solution to reinforcement learning problems is often regarded as a
policy ˀ , or an action selection procedure, that leads to an optimal outcome. When
implementing reinforcement learning however, the Bellman equation does not have
to be explicitly solved. Rather, the identification of optimal policies in reinforcement
learning problems is based on the notion that future information is relevant to the
valuation of current states, and that accurate estimations of state values can be used
to determine the optimal policy.
Search WWH ::




Custom Search