Information Technology Reference
In-Depth Information
optimal play by our player. Furthermore, the moves then taken (except on
exploratory moves) are in fact the optimal moves against the opponent. In other
words, the method converges to an optimal policy for playing the game. If the
step-size parameter is not reduced all the way to zero over time, then this player
also plays well against opponents that slowly change their way of playing.
This example illustrates the differences between evolutionary methods and
methods that learn value functions. To evaluate a policy, an evolutionary method
must hold it fixed and play many games against the opponent, or simulate many
games using a model of the opponent. The frequency of wins gives an unbiased
estimate of the probability of winning with that policy, and can be used to direct
the next policy selection. But each policy change is made only after many games,
and only the final outcome of each game is used: what happens during the games
is ignored. For example, if the player wins, then all of its behavior in the game is
given credit, independently of how specific moves might have been critical to the
win. Credit is even given to moves that never occurred! Value function methods,
in contrast, allow individual states to be evaluated. In the end, both evolutionary
and value function methods search the space of policies, but learning a value
function takes advantage of information available during the course of play.
This simple example illustrates some of the key features of reinforcement
learning methods. First, there is the emphasis on learning while interacting with
an environment, in this case with an opponent player. Second, there is a clear
goal, and correct behavior requires planning or foresight that takes into account
delayed effects of one's choices. For example, the simple reinforcement learning
player would learn to set up multi-move traps for a shortsighted opponent. It is a
striking feature of the reinforcement learning solution that it can achieve the
effects of planning and looking ahead without using a model of the opponent or
conducting an explicit search over possible sequences of future states and
actions.
While this example illustrates some of the key features of reinforcement
learning, it is so simple that it might give the impression that reinforcement
learning is more limited than it really is. Although tic-tac-toe is a two-person
game, reinforcement learning also applies in the case in which there is no
external adversary, that is, in the case of a "game against nature." Reinforcement
learning also is not restricted to problems in which behavior breaks down into
separate episodes, like the separate games of tic-tac-toe, with reward only at the
end of each episode. It is just as applicable when behavior continues indefinitely
and when rewards of various magnitudes can be received at any time.
Search WWH ::




Custom Search