The Tandem Truck Backer-Upper Problem - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

7.1

Reinforcement Learning Implementation

The tandem trailer problem can be formulated as a reinforcement learning problem

just as with the single trailer problem. The trailer truck must be backed up from

a random initial position and orientation by controlling the orientation of the front

wheels of the cab. When the truck reaches the goal (within some tolerance), a positive

reward is provided indicating that actions in that episode were useful. Alternatively,

when the truck does not reach the goal, negative feedback is provided indicating that

actions in that episode were not beneficial. The truck may not reach the goal for a

number of reasons, such as if the truck jack-knifes at either hinge, the maximum

number of time steps is reached, or if state-space constraints are violated (i.e., the

truck is driven far outside of an acceptable space).

Table 7.1 provides specific problem characteristics and settings for this problem.

At the beginning of each episode, the position of the rear of the truck was set to

( x , y )

0 . 5, 0 . 5],

and the other trailer angles ʸ 2 and ʸ 4 were also set to this angle. This initial state

amounts to the truck being in a random orientation, but with the cab-trailer and

trailer-trailer hinges straight. The front wheels of the cab could be oriented to nine

different directions, which allows for much greater control compared to the TBU

problem. However, the greater number of actions also means there are more choices,

which increases the complexity of learning a correct control strategy.

The reward scheme for this problem is similar to that of the TBU problem. How-

ever, as we take the view that this problem could be solved in multiple learning

phases, we are only initially interested in learning very general behavior. We there-

fore specify that the goal of this first learning phase only be to back the truck to a

specific location. When the trailer truck re ached th e goal location at ( x , y )

=

(50,

−

10). The cab angle ʸ 0 was randomly initialized over U [

−

=

(0, 0)

x 2

(within a rather loose tolerance of d

1 was pro-

vide as positive feedback. While this is somewhat removed from requiring that the

path of the truck be constrained by physical barriers or that the truck be in a specific

orientation at the goal, the goal used herein requires that actions be learned that avoid

the tandem trailer truck from jack-knifing at either hinge, which is arguably one of

the most difficult parts of this task.

If, during an episode, the trailer truck jack-knifed, exited the domain boundaries,

or the maximum number of time steps was exceeded, feedback of

=

+

y 2

≤

10), a reward of

+

0 . 3 was provided.

The angles between the cab and the first trailer and the first and second trailers, ʸ 1 and

ʸ 3 , respectively, were computed at every time step, and if either of these angles were

−

outside of −

2 , 2 , the truck had jack-knifed and the episode was terminated. The

domain boundaries were set to x , y

ˀ

100, 100], and if at any point the truck exited

these boundaries, the episode was also terminated. The reward and penalty values

were chosen somewhat based on trial and error during initial domain development,

but also based on requiring that (naturally) the reward be positive and the penalties

be negative, and that the magnitude of the reward (which occurs less frequently) be

greater than the penalties.

=

[

−

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home