Information Technology Reference
In-Depth Information
technique for grouping states (Soft State Aggregation, or SSA) which allow them
to represent an infinite space of continuous values, with D dimensions, in a
D -dimensional finite space (clusters) [17]. For the study case the environment
state space is seen as a
D -dimensional matrix, grouping the states (
R D )in
i
an exponential way. The discrete position (or index of the cluster)
for the
continuous variable x is computed as:
i =
min( log 2 ( x +1) )
Bounding i on some number ( ω ) which is to be considered near to the maximum
that is to be seen for that variable, for D dimensions that are taken into account
for the environment representation. Therefore the total set of states which every
agent must represent is ω D ×|A|
. The number of actions
|A|
is known by every
agent (although it may change with time).
Two different types of agents will be differentiated; those who can receive and
understand social opinions -executing Social-Welfare RL- and those which not
-using standard Q-learning.
Learning of the Q -table was done individually for both the social-aware agents
and the non-social. Parameter ω was fixed at 5, which means that the last state
for each dimension will represent values of
x ∈
[15 , +
[. The dimensions (or
variables) taken into account to represent the environment are the number of
products of each kind in the market plus agent's life and balance .
Three types of products where loaded: Wheat , with no dependencies and need-
ing 1 cycle. Flour , requiring 2 units of flour and 2 cycles. And Bread , eatable,
providing 10 units of life, requiring 2 units of flour and 2 cycles.
The actions available to the agents are the creation of any kind of the products,
plus another one called eating . The total space of representation needed by each
agent is only of (5) (3+2)
4 = 12500 states. Learning stage has taken 1 , 000 , 000
cycles, with a probability of exploration k = 1, hopping to explore as many
states as possible, as many times they could. Every agent starts with exactly
the same Q -matrix at the beginning of the simulation. The experiment includes
three different scenarios, thus there are two significant changes. The scenarios
are:
×
1. No changes in the environment.
2. Relaxing the production rules: Flour needs 1 unit of time and 1 unit of
Wheat ; Bread needs 1 unit of time and 1 unit of Flour .
3. Hardening the production rules: Wheat needs 2 units of time; Flour needs
2 units of time and 3 units of Wheat ; Bread stays as originally, needing 2
units of time and 2 units of Flour .
As a remark, it is important to control the value of κ such that never reaches
0. For that matter, the computation of
κ
is corrected by a
0whichis
representable by a machine so that ≤ κ ≤
1.
It is expected to observe a better adaptation of the social-aware agents com-
pared to the non-social ones which use traditional reinforcement learning (SSA
Q-learning).
Search WWH ::




Custom Search