Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

modeling and analysis. The Markov decision process (MDP) is a specific process

which entails sequential decisions and consists of a set of states X , a set of actions

A , and a reward function R ( x ). Each state x ∈ X has an associated true state value

V ∗ ( x ). The decision process consists of repeated agent-environment interactions,

during which the agent learns to associate states visited during the process with

outcomes of entire process, thus attempting to learn the true value of each state.

However, the true state values are unobservable to the agent, and thus the agents'

estimates of the state values V ( x ) are merely approximations of the true state values.

There are many cases where the Markovian assumption may not be valid. Such

cases include those where the agent either cannot perfectly observe the state infor-

mation, in which case the problem is referred to as a partially-observable Markov

decision processes (POMDPs) (Singh et al. 1994 ), or if there is a long temporal

dependence between states and the feedback provided. Approaches to solving these

types of problem often include retaining some form of the state history (Wierstra

et al. 2007 ), such as by using recurrent neural networks or a more complex variant

that relies on long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997 ).

These approaches, however, are not considered in this work.

The transition between states x ( t )

x and x ( t + 1)

x when pursuing action a may

xx =

{

x ( t + 1)

x ( t )

x , a ( t )

}

be represented by a transition probability

For problems in which the transition probabilities between all states are known, the

transition probabilities can be used to formulate an explicit model of the agent-

environment interaction process. A policy ˀ is defined to be the particular sequence

of actions a ={ a (0) , a (1) , ... , a ( T )

that is selected throughout the decision making

process. Similarly, the expected reward for transitioning from state x to state x

when pursuing action a may be represented by

}

= E{ r ( t + 1)

| x ( t )

= x , a ( t )

a , x ( t + 1)

. An optimal policy ˀ ∗ is considered to be the set of actions pursued

during the process maximizes the total cumulative reward

x }

The optimal policy can be determined if the true state values V ∗ ( x ) are known

for every state. The true state values can be expressed using the Bellman optimality

equation (Bertsekas 1987 ), which states that the value of being in state x and pursuing

action a is a function of both the expected return of the current state and the optimal

policy for all subsequent states:

E r ( t + 1)

ʳV ∗ x ( t + 1)

V ∗ ( x )

x ( t )

x , a ( t )

max

(2.2)

∈

xx R

xx + ʳV ∗ ( x )

max

a ∈ A

(2.3)

x +

The Bellman equation provides a conceptual solution to the sequential decision

making problem in that the value of being in a state is a function of all future

state values. The solution to reinforcement learning problems is often regarded as a

policy ˀ ∗ , or an action selection procedure, that leads to an optimal outcome. When

implementing reinforcement learning however, the Bellman equation does not have

to be explicitly solved. Rather, the identification of optimal policies in reinforcement

learning problems is based on the notion that future information is relevant to the

valuation of current states, and that accurate estimations of state values can be used

to determine the optimal policy.

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home