Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

optimal play by our player. Furthermore, the moves then taken (except on

exploratory moves) are in fact the optimal moves against the opponent. In other

words, the method converges to an optimal policy for playing the game. If the

step-size parameter is not reduced all the way to zero over time, then this player

also plays well against opponents that slowly change their way of playing.

This example illustrates the differences between evolutionary methods and

methods that learn value functions. To evaluate a policy, an evolutionary method

must hold it fixed and play many games against the opponent, or simulate many

games using a model of the opponent. The frequency of wins gives an unbiased

estimate of the probability of winning with that policy, and can be used to direct

the next policy selection. But each policy change is made only after many games,

and only the final outcome of each game is used: what happens during the games

is ignored. For example, if the player wins, then all of its behavior in the game is

given credit, independently of how specific moves might have been critical to the

win. Credit is even given to moves that never occurred! Value function methods,

in contrast, allow individual states to be evaluated. In the end, both evolutionary

and value function methods search the space of policies, but learning a value

function takes advantage of information available during the course of play.

This simple example illustrates some of the key features of reinforcement

learning methods. First, there is the emphasis on learning while interacting with

an environment, in this case with an opponent player. Second, there is a clear

goal, and correct behavior requires planning or foresight that takes into account

delayed effects of one's choices. For example, the simple reinforcement learning

player would learn to set up multi-move traps for a shortsighted opponent. It is a

striking feature of the reinforcement learning solution that it can achieve the

effects of planning and looking ahead without using a model of the opponent or

conducting an explicit search over possible sequences of future states and

actions.

While this example illustrates some of the key features of reinforcement

learning, it is so simple that it might give the impression that reinforcement

learning is more limited than it really is. Although tic-tac-toe is a two-person

game, reinforcement learning also applies in the case in which there is no

external adversary, that is, in the case of a "game against nature." Reinforcement

learning also is not restricted to problems in which behavior breaks down into

separate episodes, like the separate games of tic-tac-toe, with reward only at the

end of each episode. It is just as applicable when behavior continues indefinitely

and when rewards of various magnitudes can be received at any time.

Search WWH ::

Custom Search

Home