Information Technology Reference
In-Depth Information
12.4 Previous Approximative Implementations of the Theory
Since 1990 I have built simple artificial scientists or artists with an intrinsic desire
to build a better model of the world and what can be done in it. They embody ap-
proximations of the theory of Sect. 12.3 . The agents are motivated to continually
improve their models, by creating or discovering more surprising, novel patterns ,
that is, data predictable or compressible in hitherto unknown ways. They actively
invent experiments (algorithmic protocols or programs or action sequences) to ex-
plore their environment, always trying to learn new behaviours (policies) exhibiting
previously unknown regularities or patterns. Crucial ingredients are:
1. An adaptive world model, essentially a predictor or compressor of the continu-
ally growing history of actions and sensory inputs, reflecting current knowledge
about the world,
2. A learning algorithm that continually improves the model (detecting novel, ini-
tially surprising spatio-temporal patterns, including works of art, that subse-
quently become known patterns),
3. Intrinsic rewards measuring the model's improvements due to its learning algo-
rithm (thus measuring the degree of subjective novelty & surprise),
4. A separate reward optimiser or reinforcement learner, which translates those re-
wards into action sequences or behaviours expected to optimise future reward.
These ingredients make the agents curious and creative: they get intrinsically moti-
vated to acquire skills leading to a better model of the possible interactions with the
world, discovering additional “eye-opening” novel patterns (including works of art)
predictable or compressible in previously unknown ways.
Ignoring issues of computation time, it is possible to devise mathematically op-
timal, universal RL methods (Hutter 2005 , Schmidhuber 2009d ) for such systems
(Schmidhuber 2006a ; 2010 ) (2006-). However, previous practical implementations
(Schmidhuber 1991a , Storck et al. 1995 , Schmidhuber 2002a ) were non-universal
and made approximative assumptions. Among the many ways of combining meth-
ods for (1-4) we implemented the following variants:
A. Non-traditional RL based on adaptive recurrent neural networks as predictive
world models is used to maximise intrinsic reward created in proportion to pre-
diction error (Schmidhuber 1991b ).
B. Traditional RL (Kaelbling et al. 1996 ) is used to maximise intrinsic reward cre-
ated in proportion to improvements of prediction error (Schmidhuber 1991a ).
C. Traditional RL maximises intrinsic reward created in proportion to relative en-
tropies between the agent's priors and posteriors (Storck et al. 1995 ).
D. Non-traditional RL (Schmidhuber et al. 1997 ) (without restrictive Markovian as-
sumptions) learns probabilistic, hierarchical programs and skills through zero-
sum intrinsic reward games of two players, each trying to out-predict or sur-
prise the other, taking into account the computational costs of learning, and
learning when to learn and what to learn (1997-2002) (Schmidhuber 1999 ;
2002a ).
Search WWH ::




Custom Search