Information Technology Reference
In-Depth Information
m
π
(
s i ) = argmax
a∈A
Q i (
s i ,a
)
,
i =1
which selects the action that maximizes the sum of local agent Q-values.
An additional advantage of using separate modules is that some sub-tasks
(those that can be learnt independently of other agents' policies) can be trained
separately, thus reducing the exploration space as the state space shrinks to
the module's state subspace
|
s i
|
. This reduction often gives the possibility
to consider using
exploring starts
[11], which assure a better exploration of the
state space.
Two different type of tasks can be distinguished in L-MCRS: those trying to
reach one or more goals and those satisfying the physical constraints imposed
by the linking element so as to avoid the hard-to-control undesired effects (e.g,
force exerted by the hose if stretched). To cope with the different nature of
these two module types, we propose distinguishing goal modules and constraint
modules while keeping the structure coherent with that already represented in
Figure 1. Constraint modules are expected to learn which actions are not to
be carried out in a given state, while goal modules learn how to reach the goal
not being even aware of the constraints present. We use positive rewards in
the goal-oriented modules whenever the goal is reached, negative rewards in the
constraint modules if a constraint is not respected and neutral rewards otherwise.
The discount-rate parameter γ is different for both types of learning modules:
goal-oriented modules will use a rather low γ parameter and constraint-type
modules will work best with a high value, so they can learn on a one step basis.
2.2 Veto-Based Action Selection
To enforce the constraints, we propose to use a simple veto system, allowing
constraint modules to impose a veto to actions that have broken constraints
in the past. A boolean vector
c
V i ∈{
true, f alse
}
is defined for each Module
Mediator as follows:
true
if
Q i (
s i ,a j )
<v t
V i (
a j )=
,
false
otherwise
where i
, ..., c and v t is the threshold for imposing the veto. This
means that if, under state s i , a constraint module i has a value Q i (
=1
, ..., n , j
=1
s i ,a j )
below
the veto threshold, action a j
greedy
is then used to select an action among the available ones, assuring this way that
taking a random action will not result in an abrupt failure once the modules have
a minimum amount of experience, avoiding that way the main problem towards
appropriate learning. Instead of manually tweaking the parameter, we obtain
a simple approach to keep goal-focused exploration while avoiding constraint-
related termination conditions. Of course, the downside to this technique is that
it only works properly with deterministic constraints because exploration is dis-
abled for states found to be undesired in the past.
is forbidden for that state. Traditional
 
Search WWH ::




Custom Search