Information Technology Reference
In-Depth Information
learning to supervised learning. This began a pattern of confusion about the
relationship between these types of learning. Many researchers seemed to believe
that they were studying reinforcement learning when they were actually studying
supervised learning. For example, neural-network pioneers such as Rosenblatt
(1958) and Widrow and Hoff (1960) were clearly motivated by reinforcement
learning---they used the language of rewards and punishments---but the systems
they studied were supervised learning systems suitable for pattern recognition
and perceptual learning. Even today, researchers and textbooks often minimize or
blur the distinction between these types of learning. Some modern
neural-network textbooks use the term trial-and-error to describe networks that
learn from training examples because they use error information to update
connection weights. This is an understandable confusion, but it substantially
misses the essential optional character of trial-and-error learning.
The term “optimal control” came into use in the late 1950s to describe the
problem of designing a controller to minimize a measure of a dynamical system's
behavior over time. One of the approaches to this problem was developed in the
mid-1950s by Richard Bellman and colleagues by extending a 19th century
theory of Hamilton and Jacobi. This approach uses the concept of a dynamical
system's state and a value function, or “optimal return function” to define a
functional equation, now often called the Bellman equation. The class of methods
for solving optimal control problems by solving this equation came to be known
as dynamic programming (Bellman,1957). Bellman also introduced the discrete
stochastic version of the optimal control problem known as Markovian decision
processes (MDPs), and Ron Howard devised the policy iteration method for
MDPs in 1960. All of these are essential elements underlying the theory and
algorithms of modern reinforcement learning.
Finally, the temporal-difference and optimal control threads were fully
brought together in 1989 with Chris Watkins's development of Q-learning
(Watkins et al.,1989) This work extended and integrated prior work in all three
threads of reinforcement learning research. By the time of Watkins's work there
had been tremendous growth in reinforcement learning research, primarily in the
machine learning subfield of artificial intelligence, but also in neural networks
and artificial intelligence more broadly. In 1992, the remarkable success of Gerry
Tesauro's backgammon playing program, TD-Gammon(Tesauro, 1992), brought
additional attention to the field. Other important contributions made in the recent
history of reinforcement learning were too numerous to mention in this brief
account; we cite these at the end of the individual chapters in which they arise.
Search WWH ::




Custom Search