Multi-Stage Temporal Difference Learning for 2048 - Technologies and Applications of Artificial Intelligence

Information Technology Reference

In-Depth Information

In our experiments, we further use a set of features representing large tiles, namely

-tiles, where 2048. These features are used to indicate difficulty due to large

tiles. Our n-tuple network apparently outperforms the one used in [17] in terms of both

average and maximum scores, as shown in Fig. 5 and Fig. 6 respectively. In these

figures, the number of training games is up to 2 million and average/maximum scores

in y axis are sampled every 1000 games. For simplicity of analysis, we use the n-tuple

network in the rest of this paper.

3.2

MS-TD Learning

From above, TD learning is intrinsically to train and learn to obtain as high average (or

expected) scores as possible, and the experiments also demonstrate this. However, TD

learning does not necessarily lead to other criteria such as high maximum scores, high

, and the reaching ratios of , though it does often.

From the experiences for 2048, we observed that it is hard to push to 32768-tiles or

raise the reaching ratios of 32768-tiles from Fig. 6. However, for most players, earning

the maximum scores and the largest tiles is a kind of achievement to players, and was

also one of the criteria of the 2048-bot tournament [18].

In order to solve this issue, we propose MS-TD learning for 2048-like games. In this

method, we divide the learning process into multiple stages. The technique of using

multiple stages has been used to evaluate game states in [4] for Othello.

In our experiment, we divided the process into three stages with two important

splitting times, marked as and , in games. denotes the first time

when a 16384-tile is created on a board in a game, and denotes the first time

when both 16384-tile and 8192-tile are created. The learning process with three stages

is described as follows.

1.

In the first stage, use TD learning to train the feature weights starting from the

beginning, until the value function (expected score) is saturated. At the same time,

collect all the boards at in the training games. Now, the set of trained

feature weights are saved and called the Stage-1 feature weights .

2.

In the second stage, use TD learning to train another set of feature weights starting

from these collected boards in the first stage. At the same time, collect all the

boards at in the training games. Again, the set of trained feature weights

are saved and called the Stage-2 feature weights .

3.

In the third stage, use TD learning to train another set of feature weights starting

from these collected boards in the second stage. Again, the set of trained feature

weights are saved and called the Stage-3 feature weights .

When playing games, we also divide it into three stages in the following way.

1.

Before , use the Stage-1 feature weights to play.

2.

After and before , use the Stage-2 feature weights to play.

3.

After , use the Stage-3 feature weights to play.

The idea behind using more stages is to let learning be more accurate for all actions

during the second or third stage, based on the following observation. The trained

feature weights learned from the first stage (the same as the original TD learning) tend

Technologies and Applications of Artificial Intelligence

Search WWH ::

Custom Search

Home