The Truck Backer-upper Problem - Design of Experiments for Reinforcement Learning - page 109

Civil Engineering Reference

In-Depth Information

Fig. 6.1 The state of the truck

is defined by the rear trailer

position ( x , y ), the trailer

angle ʸ T , and the cab angle

ʸ C . The goal of the problem is

to back the truck into the

loading dock at ( x , y )

Cab

ʸ C

Trailer

ʸ T

(0, 0)

(

x

,

y

)

=

(0, 0)

where ʸ T

=

0.

Loading dock

arcsin A

·

sin ( ʸ C )

L T

ʸ T

=

ʸ T

−

arcsin v

sin ( u )

L C +

·

ʸ C =

ʸ C +

L T

where A

6

(cab length). The wheel angle relative to the cab angle is specified by u (radians),

and three discrete actions were allowed: u

=

v

·

cos ( u ), B

=

A

·

cos ( ʸ C ), v

=

3, L T

=

14 (tailer length), and L C =

. The truck velocity was not

taken into account as backing the trailer is assumed to be a slow process. The truck

was restricted to the domain boundaries x

={−

1, 0, 1

}

100, 100]. The goal

of this problem was to have the trailer positioned at the loading dock with a specific

orientation within a fixed number of time steps. This goal criteria can be represented

as: x

=

[0, 200] and y

=

[

−

=

0, y

=

0, and ʸ T

=

0.

6.1

Reinforcement Learning Implementation

The truck backer-upper problem can be viewed as a reinforcement learning problem

where each learning run attempts to back up the truck from some initial state to a

goal state, and where many learning runs are used to learn how to control the truck

at different locations and orientations throughout the domain. More specifically, the

truck begins at a random location and orientation, and the wheels of the truck are

controlled to back up the truck to a specific location and orientation. When the truck

reaches the goal, positive feedback is provided, indicating that the control strategy

in that learning run was good, whereas if the truck does not reach the goal for some

reason, negative feedback is provided.

The temporal difference algorithm TD( ʻ ) (Sect. 2.2.3) was used to train a three-

layer neural network to learn the value function V ( x , a ) that approximates the value

of being in state x and taking action a (at time the current time step). The neural

network had four inputs, corresponding to the four state variables. This is a control-

type problem that naturally lends itself to using a neural network with an output node

for each of the possible actions, and thus the neural network had three output nodes.

The number of nodes in the hidden layer was a variable in the experimental design.

The x and y components of the state vector were scaled over [

−

3, 3] based on the

Next Page

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home