Information Technology Reference
In-Depth Information
In our experiments, we further use a set of features representing large tiles, namely
-tiles, where 2048. These features are used to indicate difficulty due to large
tiles. Our n-tuple network apparently outperforms the one used in [17] in terms of both
average and maximum scores, as shown in Fig. 5 and Fig. 6 respectively. In these
figures, the number of training games is up to 2 million and average/maximum scores
in y axis are sampled every 1000 games. For simplicity of analysis, we use the n-tuple
network in the rest of this paper.
3.2
MS-TD Learning
From above, TD learning is intrinsically to train and learn to obtain as high average (or
expected) scores as possible, and the experiments also demonstrate this. However, TD
learning does not necessarily lead to other criteria such as high maximum scores, high
, and the reaching ratios of , though it does often.
From the experiences for 2048, we observed that it is hard to push to 32768-tiles or
raise the reaching ratios of 32768-tiles from Fig. 6. However, for most players, earning
the maximum scores and the largest tiles is a kind of achievement to players, and was
also one of the criteria of the 2048-bot tournament [18].
In order to solve this issue, we propose MS-TD learning for 2048-like games. In this
method, we divide the learning process into multiple stages. The technique of using
multiple stages has been used to evaluate game states in [4] for Othello.
In our experiment, we divided the process into three stages with two important
splitting times, marked as and , in games. denotes the first time
when a 16384-tile is created on a board in a game, and denotes the first time
when both 16384-tile and 8192-tile are created. The learning process with three stages
is described as follows.
1.
In the first stage, use TD learning to train the feature weights starting from the
beginning, until the value function (expected score) is saturated. At the same time,
collect all the boards at in the training games. Now, the set of trained
feature weights are saved and called the Stage-1 feature weights .
2.
In the second stage, use TD learning to train another set of feature weights starting
from these collected boards in the first stage. At the same time, collect all the
boards at in the training games. Again, the set of trained feature weights
are saved and called the Stage-2 feature weights .
3.
In the third stage, use TD learning to train another set of feature weights starting
from these collected boards in the second stage. Again, the set of trained feature
weights are saved and called the Stage-3 feature weights .
When playing games, we also divide it into three stages in the following way.
1.
Before , use the Stage-1 feature weights to play.
2.
After and before , use the Stage-2 feature weights to play.
3.
After , use the Stage-3 feature weights to play.
The idea behind using more stages is to let learning be more accurate for all actions
during the second or third stage, based on the following observation. The trained
feature weights learned from the first stage (the same as the original TD learning) tend
Search WWH ::




Custom Search