Information Technology Reference
In-Depth Information
Consider the example shown in Fig. 2. For the left chance node of the root, its value
is derived by calculating the expected value of its two children nodes: 0.9 * 10 + 0.1 *
30 = 12. For the right chance node of the root, its value is 11 which can be calculated
in the same way. Thus, the root will choose the left and its value is 12.
Several features were used in heuristic functions for 2048-bot programs [10][21],
and three of them are listed as follows. The first is the monotonicity of a board. Most
high-ranked players might tend to play the game 2048 with tiles arranged decreasingly
such as those described in [15]. The second is the number of empty tiles on board. The
more empty tiles, the less likely for the game to end in a few steps. The third is the
number of mergeable tiles, since it is a measure of the ability to create empty tiles by
merging tiles of same values.
For game tree search, transposition table is a very important technique for speed-up.
One common implementation is based on Z-hashing [23]. In Z-hashing, for each
position, each possible tile is associated with a unique random number as a key. When
looking up table, say 2048, the hash value is calculated by making an exclusive-or
operation on 16 keys.
2.3
Temporal Difference (TD) Learning
Reinforcement learning is an important technique in training an agent to learn how to
respond to the given environment [16]. Markov decision process ( MDP ) is a model
commonly used in reinforcement learning, modeling the problems in which an agent
interacts with the given environment through a sequence of actions according to the
change of the state and the rewards, if any. In terms of MDP, an AI program for 2048-
like game thus can be regarded as such an agent, which makes actions (legal moves) on
board states and gets points as rewards.
Temporal difference ( TD ) learning [16][22], a kind of reinforcement learning, is a
model-free method for adjusting state values from the subsequent evaluations. This
method has been applied to computer games such as Backgammon [19], Checkers
[13], Chess [2], Shogi [3], Go [14], Connect6 [22] and Chinese Chess [20]. TD
learning has been demonstrated to improve world class game-playing programs in the
two cases, TD-Gammon [19] and Chinook [14]. Since 2048-like games can be easily
modeled as MDP, TD learning can be naturally applied to AI programs for 2048-like
games.
In TD(0), the value function s is used to approximate the expected return of a
state s . The error between states and is , where
is the reward at turn 1 . The value of in TD(0) is expected to be
adjusted by the following value difference ,
(1)
where is a step-size parameter to control the learning rate. For general TD( )
(also see [22]), the value difference is
.
1
(2)
In this paper, only TD(0) is investigated.
Search WWH ::




Custom Search