Information Technology Reference
In-Depth Information
12.4 Previous Approximative Implementations of the Theory
Since 1990 I have built simple artificial scientists or artists with an intrinsic desire
to build a better model of the world and what can be done in it. They embody ap-
proximations of the theory of Sect.
12.3
. The agents are motivated to continually
improve their models, by creating or discovering more
surprising, novel patterns
,
that is, data predictable or compressible in hitherto unknown ways. They actively
invent experiments (algorithmic protocols or programs or action sequences) to ex-
plore their environment, always trying to learn new behaviours (policies) exhibiting
previously unknown regularities or patterns. Crucial ingredients are:
1.
An adaptive world model, essentially a predictor or compressor of the continu-
ally growing history of actions and sensory inputs, reflecting current knowledge
about the world,
2.
A learning algorithm that continually improves the model (detecting novel, ini-
tially surprising spatio-temporal patterns, including works of art, that subse-
quently become known patterns),
3.
Intrinsic rewards measuring the model's improvements due to its learning algo-
rithm (thus measuring the
degree
of subjective novelty & surprise),
4.
A separate reward optimiser or reinforcement learner, which translates those re-
wards into action sequences or behaviours expected to optimise future reward.
These ingredients make the agents curious and creative: they get intrinsically moti-
vated to acquire skills leading to a better model of the possible interactions with the
world, discovering additional “eye-opening” novel patterns (including works of art)
predictable or compressible in previously unknown ways.
Ignoring issues of computation time, it is possible to devise mathematically op-
timal,
universal
RL methods (Hutter
2005
, Schmidhuber
2009d
) for such systems
(Schmidhuber
2006a
;
2010
) (2006-). However, previous practical implementations
(Schmidhuber
1991a
, Storck et al.
1995
, Schmidhuber
2002a
) were non-universal
and made approximative assumptions. Among the many ways of combining meth-
ods for (1-4) we implemented the following variants:
A.
Non-traditional RL based on adaptive recurrent neural networks as predictive
world models is used to maximise intrinsic reward created in proportion to pre-
diction error (Schmidhuber
1991b
).
B.
Traditional RL (Kaelbling et al.
1996
) is used to maximise intrinsic reward cre-
ated in proportion to improvements of prediction error (Schmidhuber
1991a
).
C.
Traditional RL maximises intrinsic reward created in proportion to relative en-
tropies between the agent's priors and posteriors (Storck et al.
1995
).
D.
Non-traditional RL (Schmidhuber et al.
1997
) (without restrictive Markovian as-
sumptions) learns probabilistic, hierarchical programs and skills through zero-
sum intrinsic reward games of two players, each trying to out-predict or sur-
prise the other, taking into account the computational costs of learning, and
learning
when
to learn and
what
to learn (1997-2002) (Schmidhuber
1999
;
2002a
).