Database Reference
In-Depth Information
For reinforcement learning, which encompasses more than just dynamic
programming, the idea of policy iteration has been generalized to general policy
iteration (GPI) , which is illustrated in Fig. 3.6 .
The GPI approach is thus the following: the policy is constantly being improved
with respect to the action-value function (policy improvement), and the action-
value function is constantly being driven toward the action-value function of the
policy (policy evaluation). If we compare Fig. 3.6 with the general case of
the adaptive analytics method shown in Fig. 1.1 , the policy evaluation represents
the analysis step, while the policy improvement corresponds to the action. Virtually
all reinforcement learning methods can be described as GPI.
But that's not all. The classic cross-selling approach, that is, Approach I from
Chap. 2 , can also be interpreted as an instance of GPI: the user behavior is analyzed
(data mining model is evaluated) and its most promising products recommended
at every process step. This represents policy improvement. After this, the user
behavior, as changed by these recommendations, is analyzed afresh (a data mining
model is created), which corresponds to policy evaluation. Once again, the most
promising recommendations are derived from this, and so on. Of course, this is not a
mathematically rigorous reasoning, but it reveals the power of the GPI approach.
3.7 The Adaptive Case
Thus far, we have been solving the Bellman equation ( 3.4 ) merely by means of
static methods. Specifically, we have been assuming that a model of the environ-
ment be available beforehand in the form of its transition probabilities p ss 0 and
rewards r ss 0 , which then enables us to compute an optimal policy by means of
complex policy iteration once and for all. In this respect, we have been acting no
differently from the classical data mining approach (although, of course, the
transition to the control problem adds a new level of quality).
An interesting observation is that the policy iteration used here can be
interpreted throughout as a realtime analytics method in accordance with Fig. 1.1 .
This raises the question of whether the problem ( 3.4 ) may also be solved adaptively.
The answer is yes! There are two main possibilities here:
1. Calculating the transition probabilities p ss 0
and rewards r ss 0
incrementally, then
seeking after an adaptive solution to ( 3.4 )
2. Abandoning the model of the environment completely, that is, solve the Bellman
equation ( 3.4 ) indirectly, without explicit representations of p ss 0 and r ss 0
The starting point is the fundamental update equation
X 1 ¼ X k þ α 1 x 1 X k
ð
Þ ,
ð 3
:
8 Þ
where X k is the estimation of the target variable x k in the k th update step. Here
( x k +1 X k ) is the current estimation error. It is reduced by taking a step toward
Search WWH ::




Custom Search