Information Technology Reference
In-Depth Information
In most applications, the evaluation function of states
can be viewed as a
function of features, such as the monotonicity, the number of empty tiles, the number
of mergeable tiles [10], mentioned in Subsection 2.2. Although the function was
actually very complicated, it is usually modified into a linear combination of features
[22] for TD learning, that is,
·
, where
denotes a vector of feature
occurrences in
, and
denotes a vector of feature weights.
In order to correct the value
by the difference
∆
, we can adjust the
feature weights
by a difference
∆
based on
, which is
for linear
TD(0) learning. Thus, the difference
∆
is
∆∆
(3)
TD Learning for 2048 In [17], Szubert and Jaskowaski proposed TD learning for 2048.
A
transition
from turn
to
1
is illustrated in Fig. 3 (below). They also
proposed three kinds of methods of evaluating values for training and learning as
follows.
Fig. 3.
Transition of board states
1.
Evaluate actions. This method is to evaluate the function
,
, which stands for
the expected values of taking an action
on a state
. For 2048, an action
is
one of the four directions, up, down, left, and right. This is so-called
Q-learning
.
In this case, the agent chooses a move with the highest expected score, as the
following formula.
max,
(4)
2.
Evaluate states to play. This method is to evaluate the value function
on
state
, the player to move. As shown in Fig. 3, this method evaluates
and
.The agent chooses a move with the highest expected score on
, as the
following formula.
max
,
,,
(5)
Search WWH ::
Custom Search