Information Technology Reference
In-Depth Information
Although the agent is learning, it cannot distinguish between what hasn't been
learned yet and what has changed. It is desirable to stop learning (or to learn at
a slower pace) when the agent has a good approximation for the real functions.
The probability of exploration ( k ) can help if it is a function of the uncertainty
about these functions.
Following the previous model (in section 3), the KAI index will represent
k , and it has to be computed from the emotional response of each agent. The
emotional response will have to include means to feel (compute) the uncertainty
of the learned functions.
Taking into account, also from the model, that the agent should give a negative
opinion when its perception of the situation is bad -and positive when it is good -
, the emotional response will also have a way to express these outcomes when
executing actions in the environment. To fulfill these two requirements, two of
the OCC emotions will be directly linked to the computation of the KAI index
k : the joy/distress of the agent, noted by
J
, and the hope/fear, noted by
H
.
The functions given for
will define a pessimistic agent. I.e. the agent
will feel distress more easily than joy, when changes occur. That will enable the
agent to move faster into different policies, some of which could improve the
situation -and thus make the agent feel joy-. The disadvantage is that the agent
will need more time to converge to a stable policy. In human terms, the agent's
behavior could be identified as being cautious .
The functions which compute
J
and
H
J
and
H
are shown below:
1 Q t−p
p
Q t +1 − Q t
Q t +1 − M p
2max(
J
=
;
H
=
;
M p =
|Q t +1 |, |Q t |
)+
|Q t +1 |, |M p |
)+
max(
Notation: Q t means the discounted reward the agent 1 has learned so far at
timestep t for the given state and the taken action, such as Q t ≡ Q t ( s t ,a t ). The
normalization is done by the maximum in the pair Q t +1 ,Q t ;then
J∈
]
1 , +1[.
M p is the average of the last p steps.
1 , +1[ is the agent's hope to improve
its performance, remembering only p last values.
H∈
]
0, to avoid division by 0
in
.
The next step is to define the social opinion function. The agent will give its
opinion based on the satisfaction it experiences, so it seems logical to describe
a function for the representation of the OCC emotion “satisfaction” and use its
value as the opinion as well. First, lets define a prospective happiness value as
Ψ
J
and
H
= JH
+
J
. This definition of Ψ
allows an agent to give a positive opinion
2
when it feels joy and feels no fear (
J→
1 ,H→
1). The agent will give a negative
opinion when it feels distress and has fear (
J→−
1 ,H→−
1). The limit values
of
Ψ
are shown in table 1 (also, in a static environment, when
t →∞
,itis
expected to find
J→
0and
H→
0, by its definitions).
1 Subscripts have been omitted for the sake of clarity; e.g. when it is said Q ( s, a )it
really means Q ( s, a ) A i , for the agent A i . It holds for every parameter in the section.
Search WWH ::




Custom Search