Information Technology Reference
In-Depth Information
To encourage the agent to actively create data leading to easily learnable im-
provements of p (Schmidhuber 1991a ), the reward signal r(t) is split into two scalar
real-valued components: r(t) = g(r ext (t), r int (t)) , where g maps pairs of real values
to real values, e.g., g(a,b) = a + b .Here r ext (t) denotes traditional external reward
provided by the environment, such as negative reward for bumping into a wall, or
positive reward for reaching some teacher-given goal state. The Formal Theory of
Creativity, however, is mostly interested in r int (t) ,the intrinsic reward, which is pro-
vided whenever the model's quality improves—for purely creative agents r ext (t)
0
for all valid t . Formally, the intrinsic reward for the model's progress (due to some
application-dependent model improvement algorithm) between times t and t
=
+
1is
1 ) = f C p(t),h( t +
1 ) ,C p(t +
1 ) ,
r int (t +
1 ), h( t +
(12.2)
where f maps pairs of real values to real values. Various progress measures are pos-
sible; most obvious is f(a,b)
b . This corresponds to a discrete time version
of maximising the first derivative of the model's quality. Both the old and the new
model have to be tested on the same data, namely, the history so far . That is, progress
between times t and t
=
a
+
+
1 is defined based on two models of h(
t
1 ) , where the
old one is trained only on h(
t) and the new one also gets to see h(t
t
+
1 ) .This
is like p(t) predicting data of time t +
1, then observing it, then learning something,
then becoming a measurably improved model p(t +
1 ) .
The above description of the agent's motivation separates the goal (finding or
making data that can be modelled better or faster than before) from the means of
achieving the goal. The controller's RL mechanism must figure out how to translate
such rewards into action sequences that allow the given world model improvement
algorithm to find and exploit previously unknown types of regularities. It must trade
off long-term vs short-term intrinsic rewards of this kind, taking into account all
costs of action sequences (Schmidhuber 1999 ; 2006a ).
The field of Reinforcement Learning (RL) offers many more or less powerful
methods for maximising expected reward as requested above (Kaelbling et al. 1996 ).
Some were used in our earlier implementations of curious, creative systems; see
Sect. 12.4 for a more detailed overview of previous simple artificial scientists and
artists (1990-2002). Universal RL methods (Hutter 2005 , Schmidhuber 2009d )as
well as RNN-based RL (Schmidhuber 1991b ) and SSA-based RL (Schmidhuber
2002a ) can in principle learn useful internal states memorising relevant previous
events; less powerful RL methods (Schmidhuber 1991a , Storck et al. 1995 ) cannot.
In theory C(p,h(
t)) should take the entire history of actions and perceptions
into account (Schmidhuber 2006a ), like the performance measure C xry :
t
C xry p,h( t) =
pred p,x(τ) x(τ)
+ pred p,r(τ) r(τ)
2
2
τ
=
1
pred p,y(τ)
y(τ)
2
+
(12.3)
where pred(p, q) is p 's prediction of event q from earlier parts of the history.
C xry ignores the danger of overfitting (too many parameters for few data) through
a p that stores the entire history without compactly representing its regularities,
Search WWH ::




Custom Search