Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

for each possible state of the game. Each number will be the latest estimate of the

probability of our winning from that state. We treat this estimate as the state's

value, and the whole table is the learned value function. State A has higher value

than state B, or is considered "better" than state B, if the current estimate of the

probability of our winning from A is higher than it is from B. Assuming we

always play ◇ s, then for all states with three ◇ s in a row the probability of

winning is 1, because we have already won. Similarly, for all states with three Os

in a row, or that are "filled up," the correct probability is 0, as we cannot win

from them. We set the initial values of all the other states to 0.5, representing a

guess that we have a 50% chance of winning.

We play many games against the opponent. To select our moves we examine

the states that would result from each of our possible moves (one for each blank

space on the board) and look up their current values in the table. Most of the time,

we move greedily, selecting the move that leads to the state with the greatest

value, that is, with the highest estimated probability of winning. Occasionally,

however, we select randomly from among the other moves instead. These are

called exploratory moves because they make us to experience states that we

might otherwise never see. A sequence of moves made and considered during a

game can be diagrammed as in Figure 10.6.

While we are playing, we change the values of the states in which we find

ourselves during the game. We attempt to make more accurate estimates of the

probabilities of winning. To do this, we “back up” the value of the state after

each greedy move to the state before the move, as suggested by the arrows in

Figure 10.6. More precisely, the current value of the earlier state is adjusted to be

closer to the value of the later state. This can be done by moving the earlier

state's value a fraction of the way toward the value of the later state. If we let s n

denote the state before the greedy move, and s n+1 the state after the move, then

the update to the estimated value of s, denoted V(s), can be written as,

V s

(

)

=

S s

(

+

c V s

(

)

−

V s

(

))

(10.14)

n

+

1

n

where

is a small positive fraction called the step-size parameter, which

influences the rate of learning. This update rule is an example of a

temporal-difference learning method, so called because its changes are based on

a difference, V(

c

s n ) , between estimates at two different times.

The method described above performs quite well on this task. For example, if

the step-size parameter is reduced properly over time, this method converges, for

any fixed opponent, to the true probabilities of winning from each state given

s n+ 1 )-V(

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home