Multi-Stage Temporal Difference Learning for 2048 - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

Consider the example shown in Fig. 2. For the left chance node of the root, its value

is derived by calculating the expected value of its two children nodes: 0.9 * 10 + 0.1 *

30 = 12. For the right chance node of the root, its value is 11 which can be calculated

in the same way. Thus, the root will choose the left and its value is 12.

Several features were used in heuristic functions for 2048-bot programs [10][21],

and three of them are listed as follows. The first is the monotonicity of a board. Most

high-ranked players might tend to play the game 2048 with tiles arranged decreasingly

such as those described in [15]. The second is the number of empty tiles on board. The

more empty tiles, the less likely for the game to end in a few steps. The third is the

number of mergeable tiles, since it is a measure of the ability to create empty tiles by

merging tiles of same values.

For game tree search, transposition table is a very important technique for speed-up.

One common implementation is based on Z-hashing [23]. In Z-hashing, for each

position, each possible tile is associated with a unique random number as a key. When

looking up table, say 2048, the hash value is calculated by making an exclusive-or

operation on 16 keys.

2.3

Temporal Difference (TD) Learning

Reinforcement learning is an important technique in training an agent to learn how to

respond to the given environment [16]. Markov decision process ( MDP ) is a model

commonly used in reinforcement learning, modeling the problems in which an agent

interacts with the given environment through a sequence of actions according to the

change of the state and the rewards, if any. In terms of MDP, an AI program for 2048-

like game thus can be regarded as such an agent, which makes actions (legal moves) on

board states and gets points as rewards.

Temporal difference ( TD ) learning [16][22], a kind of reinforcement learning, is a

model-free method for adjusting state values from the subsequent evaluations. This

method has been applied to computer games such as Backgammon [19], Checkers

[13], Chess [2], Shogi [3], Go [14], Connect6 [22] and Chinese Chess [20]. TD

learning has been demonstrated to improve world class game-playing programs in the

two cases, TD-Gammon [19] and Chinook [14]. Since 2048-like games can be easily

modeled as MDP, TD learning can be naturally applied to AI programs for 2048-like

games.

In TD(0), the value function s is used to approximate the expected return of a

state s . The error between states and is , where

is the reward at turn 1 . The value of in TD(0) is expected to be

adjusted by the following value difference ∆ ,

∆ (1)

where is a step-size parameter to control the learning rate. For general TD( )

(also see [22]), the value difference is

.

∆ 1

(2)

In this paper, only TD(0) is investigated.

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home