Information Technology Reference
In-Depth Information
technique for grouping states (Soft State Aggregation, or SSA) which allow them
to represent an infinite space of continuous values, with
D
dimensions, in a
D
-dimensional finite space (clusters) [17]. For the study case the environment
state space is seen as a
D
-dimensional matrix, grouping the states (
∈
R
D
)in
i
an exponential way. The discrete position (or index of the cluster)
for the
continuous variable
x
is computed as:
i
=
min(
log
2
(
x
+1)
,ω
)
Bounding
i
on some number (
ω
) which is to be considered near to the maximum
that is to be seen for that variable, for
D
dimensions that are taken into account
for the environment representation. Therefore the total set of states which every
agent must represent is
ω
D
×|A|
. The number of actions
|A|
is known by every
agent (although it may change with time).
Two different types of agents will be differentiated; those who can receive and
understand social opinions -executing Social-Welfare RL- and those which not
-using standard Q-learning.
Learning of the
Q
-table was done individually for both the social-aware agents
and the non-social. Parameter
ω
was fixed at 5, which means that the last state
for each dimension will represent values of
x ∈
[15
,
+
∞
[. The dimensions (or
variables) taken into account to represent the environment are the number of
products of each kind in the market plus agent's
life
and
balance
.
Three types of products where loaded:
Wheat
, with no dependencies and need-
ing 1 cycle.
Flour
, requiring 2 units of
flour
and 2 cycles. And
Bread
, eatable,
providing 10 units of life, requiring 2 units of
flour
and 2 cycles.
The actions available to the agents are the creation of any kind of the products,
plus another one called
eating
. The total space of representation needed by each
agent is only of (5)
(3+2)
4 = 12500 states. Learning stage has taken 1
,
000
,
000
cycles, with a probability of exploration
k
= 1, hopping to explore as many
states as possible, as many times they could. Every agent starts with exactly
the same
Q
-matrix at the beginning of the simulation. The experiment includes
three different scenarios, thus there are two significant changes. The scenarios
are:
×
1. No changes in the environment.
2. Relaxing the production rules:
Flour
needs 1 unit of time and 1 unit of
Wheat
;
Bread
needs 1 unit of time and 1 unit of
Flour
.
3. Hardening the production rules:
Wheat
needs 2 units of time;
Flour
needs
2 units of time and 3 units of
Wheat
;
Bread
stays as originally, needing 2
units of time and 2 units of
Flour
.
As a remark, it is important to control the value of
κ
such that never reaches
0. For that matter, the computation of
κ
is corrected by a
→
0whichis
representable by a machine so that
≤ κ ≤
1.
It is expected to observe a better adaptation of the social-aware agents com-
pared to the non-social ones which use traditional reinforcement learning (SSA
Q-learning).