Civil Engineering Reference
In-Depth Information
Table 6.1 Domain characteristics for the truck backer-upper domain.
Initial conditions
x
=
U [100, 150]
y
=
U [
20, 20]
ʸ T
=
U [
1 . 0, 1 . 0] (radians)
ʸ C =
U [
0 . 5, 0 . 5] (radians)
Actions
3: [0 . 0,
±
1 . 0] (radians)
Rewards/penalties
Achieve goal:
5
Exit domain boundaries: 0 . 1
Trailer-cab jack-knife: 0 . 1
Large trailer angle: 0 . 1
T max exceeded: 0 . 1
+
0 . 2
2
Reward function per time step a
0 . 15 x 0 . 6
1 . 2
r
=
0 . 01
|
y
|
0 . 5 ʴʸ T
+
Goal tolerance a
d
5
ʴʸ T
0 . 5
Number of episodes
10,000
Number of time steps per episode T max
300
Performance time window ( p win )
300
conv val
0.5
conv rng
0.005
10 4
conv m
1
×
x 2
a ʴʸ T
ʸ T
=
ʸ t
[
ˀ , ˀ ]; d
=
+
y 2
boundaries of the domain in order to put these state values on approximately the same
scale as ʸ T and ʸ C . The hidden layer used a hyperbolic tangent transfer function and
the output layer used a linear transfer function. The input and hidden layers both had
bias nodes with constant values of +1. Network weights were initialized by sampling
from U [
0 . 1, 0 . 1].
Additional characteristics and parameter settings of the domain are shown in
Table 6.1 . Each episode began with the initial position and the orientation of the truck
sampled from relatively wide uniform distributions. The goal of this problem was to
position the truck at the loading dock (positioned at (0, 0)) in the correct orientation
such that its distance to the loading dock was less than 5, and the difference in its
orientation with respect a neutral orientation (i.e., ʸ T
0) was less than 0.5. These
bounds are somewhat loose though we could consider this initial learning procedure
to be a seed to subsequent training, and thus we are interested in learning general
knowledge about controlling the truck in this initial stage. When the truck reached the
goal within these tolerances, a reward of r
=
5 was provided. When the truck was
outside of this region, a reward was provided based on a function of the trailer position
and orientation as specified in Table 6.1 . This reward function was conceived based
on its shape over the state variable space such that the reward is greater (i.e., positive
=+
Search WWH ::




Custom Search