Information Technology Reference
In-Depth Information
about taking action a in state s (described by a real number Q
(
s, a
)
)usingthe
following expression:
α r
,
s ,a )
Q
(
s, a
)
Q
(
s, a
)+
+
γ
· max
a {
Q
(
Q
(
s, a
) }
(1)
where α
is a discount-rate pa-
rameter that weights earlier rewards higher than later ones. All those
[0
,
1]
is the agent's learning speed, γ
[0
,
1]
Q-values
are usually stored in a look-up table and each of the entries represents how good
an action has been in a state.
algorithm updates entries as soon
as an action is taken based on the estimated value of the new state observed;
as it can be seen in equation 1, this estimation is based on the best
Q-Learning
Q-value
available in state s . An action selection algorithm must be defined so an action
is selected according to the observed state every time-step. While the straight-
forward approach (
greedy
action selection) involves always selecting the action
with the highest Q-value
exploiting
available knowledge, this prevents the agent
exploring
pairs and thus, not allowing it to explore
possibly better actions. This compromise between
yet unknown
action-state
exploration
and
exploitation
is usually solved using a
greedy algorithm (a random action is selected with
probability while the best action is chosen with probability
). Episodic
tasks are terminated either when a terminal state sS t is reached or when the
number of steps reaches a predefined maximum step count.
The main drawback of using
1
Q-Learning
with explicit look-up tables is called
curse of dimensionality
: as the state space increases, the size needed to store
the Q matrix grows so fast it easily becomes an unpractical approach to real
applications. To deal with this problem in the L-MCRS control learning, we ex-
plore the concurrent learning approach [10] which uses simultaneously multiple
RL modules assigning each of them a sub-task to carry out. Each module repre-
sents the world as a subspace of the sensorial information available to the agent
(ideally, it only receives the state input needed to properly learn its sub-task)
and they all are able to learn concurrently. We use local rewards. Regarding the
coordination of agents [1], we will assume that robots will receive permission to
move in a round-robin schedule, that is, an explicit order will be defined and
robots will make discrete moves in turns so robots can consider the remaining
robots as static during their turn.
This paper is structured as follows: first we introduce the
Modular Concurrent
Q-Learning
concepts and issues in section 2 including features of our own mod-
ular approach to L-MCRS. Section 3 presents the application chosen to test our
approach. Section 4 shows the results obtained giving details of the conducted
experiments. Finally, conclusions and future work comments are presented in
section 5.
2 Modular Concurrent Q-Learning
Let n be the number of learning agents in a multi-component system, m the
number of modules each agent runs and c the number of possible actions each
 
Search WWH ::




Custom Search