Civil Engineering Reference
In-Depth Information
7.1
Reinforcement Learning Implementation
The tandem trailer problem can be formulated as a reinforcement learning problem
just as with the single trailer problem. The trailer truck must be backed up from
a random initial position and orientation by controlling the orientation of the front
wheels of the cab. When the truck reaches the goal (within some tolerance), a positive
reward is provided indicating that actions in that episode were useful. Alternatively,
when the truck does not reach the goal, negative feedback is provided indicating that
actions in that episode were not beneficial. The truck may not reach the goal for a
number of reasons, such as if the truck jack-knifes at either hinge, the maximum
number of time steps is reached, or if state-space constraints are violated (i.e., the
truck is driven far outside of an acceptable space).
Table 7.1 provides specific problem characteristics and settings for this problem.
At the beginning of each episode, the position of the rear of the truck was set to
( x , y )
0 . 5, 0 . 5],
and the other trailer angles ʸ 2 and ʸ 4 were also set to this angle. This initial state
amounts to the truck being in a random orientation, but with the cab-trailer and
trailer-trailer hinges straight. The front wheels of the cab could be oriented to nine
different directions, which allows for much greater control compared to the TBU
problem. However, the greater number of actions also means there are more choices,
which increases the complexity of learning a correct control strategy.
The reward scheme for this problem is similar to that of the TBU problem. How-
ever, as we take the view that this problem could be solved in multiple learning
phases, we are only initially interested in learning very general behavior. We there-
fore specify that the goal of this first learning phase only be to back the truck to a
specific location. When the trailer truck re ached th e goal location at ( x , y )
=
(50,
10). The cab angle ʸ 0 was randomly initialized over U [
=
(0, 0)
x 2
(within a rather loose tolerance of d
1 was pro-
vide as positive feedback. While this is somewhat removed from requiring that the
path of the truck be constrained by physical barriers or that the truck be in a specific
orientation at the goal, the goal used herein requires that actions be learned that avoid
the tandem trailer truck from jack-knifing at either hinge, which is arguably one of
the most difficult parts of this task.
If, during an episode, the trailer truck jack-knifed, exited the domain boundaries,
or the maximum number of time steps was exceeded, feedback of
=
+
y 2
10), a reward of
+
0 . 3 was provided.
The angles between the cab and the first trailer and the first and second trailers, ʸ 1 and
ʸ 3 , respectively, were computed at every time step, and if either of these angles were
outside of
2 , 2 , the truck had jack-knifed and the episode was terminated. The
domain boundaries were set to x , y
ˀ
100, 100], and if at any point the truck exited
these boundaries, the episode was also terminated. The reward and penalty values
were chosen somewhat based on trial and error during initial domain development,
but also based on requiring that (naturally) the reward be positive and the penalties
be negative, and that the magnitude of the reward (which occurs less frequently) be
greater than the penalties.
=
[
Search WWH ::




Custom Search