Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

learning to supervised learning. This began a pattern of confusion about the

relationship between these types of learning. Many researchers seemed to believe

that they were studying reinforcement learning when they were actually studying

supervised learning. For example, neural-network pioneers such as Rosenblatt

(1958) and Widrow and Hoff (1960) were clearly motivated by reinforcement

learning---they used the language of rewards and punishments---but the systems

they studied were supervised learning systems suitable for pattern recognition

and perceptual learning. Even today, researchers and textbooks often minimize or

blur the distinction between these types of learning. Some modern

neural-network textbooks use the term trial-and-error to describe networks that

learn from training examples because they use error information to update

connection weights. This is an understandable confusion, but it substantially

misses the essential optional character of trial-and-error learning.

The term “optimal control” came into use in the late 1950s to describe the

problem of designing a controller to minimize a measure of a dynamical system's

behavior over time. One of the approaches to this problem was developed in the

mid-1950s by Richard Bellman and colleagues by extending a 19th century

theory of Hamilton and Jacobi. This approach uses the concept of a dynamical

system's state and a value function, or “optimal return function” to define a

functional equation, now often called the Bellman equation. The class of methods

for solving optimal control problems by solving this equation came to be known

as dynamic programming (Bellman,1957). Bellman also introduced the discrete

stochastic version of the optimal control problem known as Markovian decision

processes (MDPs), and Ron Howard devised the policy iteration method for

MDPs in 1960. All of these are essential elements underlying the theory and

algorithms of modern reinforcement learning.

Finally, the temporal-difference and optimal control threads were fully

brought together in 1989 with Chris Watkins's development of Q-learning

(Watkins et al.,1989) This work extended and integrated prior work in all three

threads of reinforcement learning research. By the time of Watkins's work there

had been tremendous growth in reinforcement learning research, primarily in the

machine learning subfield of artificial intelligence, but also in neural networks

and artificial intelligence more broadly. In 1992, the remarkable success of Gerry

Tesauro's backgammon playing program, TD-Gammon(Tesauro, 1992), brought

additional attention to the field. Other important contributions made in the recent

history of reinforcement learning were too numerous to mention in this brief

account; we cite these at the end of the individual chapters in which they arise.

Search WWH ::

Custom Search

Home