Coordination of Communication in Robot Teams by Reinforcement Learning - Foundations on Natural and Artificial Computation

Information Technology Reference

In-Depth Information

the listener coincides with the speaker's meaning then a success has happened.

A failure happens when both meanings differ. After a success the correspond-

ing coecients of the association matrices in both robots are increased and the

competing association coecients (i.e. a row for the speaker and a column for

the listener) are updated in the opposite direction. This additional updating is

known as lateral inhibition and it is a key element for the convergence process.

Similarly, the coecients involved in a failure are decreased in both robots.

for k =1 , 2 ,..., max rounds do

Execute all the possible communication acts

Compute the communicative eciency of the robot team EC ( k )

if EC ( k )= Max

in three consecutive rounds then

Break

end if

end for

Fig. 1. Pseudo code of the reinforcement learning-based lexical coordination procedure

Assign randomly the sender/receiver roles

for k =1 , 2 ,..., number of meanings do

Send the meaning m k according to the sender's association matrix

Decode the received symbol s k according to the receiver's association matrix

Update both matrices depending on the communication result

end for

Fig. 2. Pseudo code of a communication act

The ultimate goal is that after the execution of all the language games rounds

the robot team converges to an optimal communication system in which all the

robots use the same optimal permutation matrix (optimal Saussurean solution).

3.2

Algorithms for the Updating of the Association Matrices

We have applied two different algorithms for the updating of the coecients of

the association matrices: (a) an Ant Colony Optimization-based algorithm, or

ACO-like for short, and (b) the incremental algorithm.

In the ACO-like algorithm the coecients of the association matrix are up-

dated as follows:

a ij ( k +1)= ρa ij ( k )+(1

−

ρ ) β ( k )

1

β ( k )= 1 freward / success

0

≤

ρ

≤

(4)

if punish / fail

in which ρ is a critical parameter which has to be carefully selected [1].

Foundations on Natural and Artificial Computation

Search WWH ::

Custom Search

Home