Information Technology Reference
In-Depth Information
S ,the i th module is
agent can take. For a discretely perceived world state s
fedwithasubset s i
s which is expected to be relevant to achieve its goal. For
the sake of simplicity, all notation will refer to structures and functions existing
in all agents and superscripts indicating the agent index will not be added. Let
us denote as Q i (
the Q matrix entry of the i th module relating its partial
state s i and action a . Termination conditions are defined as k subsets S i S where
i
s i ,a
)
k
j =1
=1
, ..., k , satisfying
S j =
S t
S .
2.1 Reward and Termination Functions
Most monolithic Q-Learning systems involve the construction of modeling func-
tions that can be expressed as a series of IF-rules: A reward function r
:
S
R
and a termination function t
:
S
→{
true, f alse
}
:
if sS t
1
if sS t
1
r 1
t 1
...
...
r
(
s
)=
,t
(
s
)=
,
if sS k
if sS k
r k
t k
r 0
else
false
else
k
i =1
k
i =1
where
,and r 0 represents the selected neutral reward.
The modular approach decomposes these functions into m reward functions and
m termination functions:
r i
R
,
t i
{
true, f alse
}
r 1
t 1
if s i S i
if s i S i
r i (
s
)=
,t i (
s
)=
.
r 0
else
false
else
Fig. 1. Modular Q-Learning Scheme
The typical approach to learn different sub-tasks or behaviors concurrently
involves using a
in the liter-
ature) responsible for action selection, as represented in Figure 1. Each module
has its own Q matrix representing its partial knowledge of the world state s i and
modules may even compete imposing their preferences to the rest. To select the
next action we will follow the Greatest Mass (GM) strategy [12] defined as:
Module Mediator
(also referred to as
Module Arbiter
 
Search WWH ::




Custom Search