The Truck Backer-upper Problem - Design of Experiments for Reinforcement Learning - page 110

Civil Engineering Reference

In-Depth Information

Table 6.1 Domain characteristics for the truck backer-upper domain.

Initial conditions

x

=

U [100, 150]

y

=

U [

−

20, 20]

ʸ T

=

U [

−

1 . 0, 1 . 0] (radians)

ʸ C =

U [

−

0 . 5, 0 . 5] (radians)

Actions

3: [0 . 0,

±

1 . 0] (radians)

Rewards/penalties

Achieve goal:

5

Exit domain boundaries: − 0 . 1

Trailer-cab jack-knife: − 0 . 1

Large trailer angle: − 0 . 1

T max exceeded: − 0 . 1

+

0 . 2 −

2

Reward function per time step a

0 . 15 x 0 . 6

1 . 2

r

=

−

0 . 01

|

y

|

−

0 . 5 ʴʸ T

+

Goal tolerance a

d ≤

5

ʴʸ T

≤

0 . 5

Number of episodes

10,000

Number of time steps per episode T max

300

Performance time window ( p win )

300

conv val

0.5

conv rng

0.005

10 − 4

conv m

1

×

x 2

a ʴʸ T

ʸ T

=

−

ʸ t ∈

[

−

ˀ , ˀ ]; d

=

+

y 2

boundaries of the domain in order to put these state values on approximately the same

scale as ʸ T and ʸ C . The hidden layer used a hyperbolic tangent transfer function and

the output layer used a linear transfer function. The input and hidden layers both had

bias nodes with constant values of +1. Network weights were initialized by sampling

from U [

0 . 1, 0 . 1].

Additional characteristics and parameter settings of the domain are shown in

Table 6.1 . Each episode began with the initial position and the orientation of the truck

sampled from relatively wide uniform distributions. The goal of this problem was to

position the truck at the loading dock (positioned at (0, 0)) in the correct orientation

such that its distance to the loading dock was less than 5, and the difference in its

orientation with respect a neutral orientation (i.e., ʸ T

−

0) was less than 0.5. These

bounds are somewhat loose though we could consider this initial learning procedure

to be a seed to subsequent training, and thus we are interested in learning general

knowledge about controlling the truck in this initial stage. When the truck reached the

goal within these tolerances, a reward of r

=

5 was provided. When the truck was

outside of this region, a reward was provided based on a function of the trailer position

and orientation as specified in Table 6.1 . This reward function was conceived based

on its shape over the state variable space such that the reward is greater (i.e., positive

=+

Next Page

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home