Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

For reinforcement learning, which encompasses more than just dynamic

programming, the idea of policy iteration has been generalized to general policy

iteration (GPI) , which is illustrated in Fig. 3.6 .

The GPI approach is thus the following: the policy is constantly being improved

with respect to the action-value function (policy improvement), and the action-

value function is constantly being driven toward the action-value function of the

policy (policy evaluation). If we compare Fig. 3.6 with the general case of

the adaptive analytics method shown in Fig. 1.1 , the policy evaluation represents

the analysis step, while the policy improvement corresponds to the action. Virtually

all reinforcement learning methods can be described as GPI.

But that's not all. The classic cross-selling approach, that is, Approach I from

Chap. 2 , can also be interpreted as an instance of GPI: the user behavior is analyzed

(data mining model is evaluated) and its most promising products recommended

at every process step. This represents policy improvement. After this, the user

behavior, as changed by these recommendations, is analyzed afresh (a data mining

model is created), which corresponds to policy evaluation. Once again, the most

promising recommendations are derived from this, and so on. Of course, this is not a

mathematically rigorous reasoning, but it reveals the power of the GPI approach.

3.7 The Adaptive Case

Thus far, we have been solving the Bellman equation ( 3.4 ) merely by means of

static methods. Specifically, we have been assuming that a model of the environ-

ment be available beforehand in the form of its transition probabilities p ss 0 and

rewards r ss 0 , which then enables us to compute an optimal policy by means of

complex policy iteration once and for all. In this respect, we have been acting no

differently from the classical data mining approach (although, of course, the

transition to the control problem adds a new level of quality).

An interesting observation is that the policy iteration used here can be

interpreted throughout as a realtime analytics method in accordance with Fig. 1.1 .

This raises the question of whether the problem ( 3.4 ) may also be solved adaptively.

The answer is yes! There are two main possibilities here:

1. Calculating the transition probabilities p ss 0

and rewards r ss 0

incrementally, then

seeking after an adaptive solution to ( 3.4 )

2. Abandoning the model of the environment completely, that is, solve the Bellman

equation ( 3.4 ) indirectly, without explicit representations of p ss 0 and r ss 0

The starting point is the fundamental update equation

X kþ 1 ¼ X k þ α kþ 1 x kþ 1 X k

ð

Þ ,

ð 3

:

8 Þ

where X k is the estimation of the target variable x k in the k th update step. Here

( x k +1 X k ) is the current estimation error. It is reduced by taking a step toward

Realtime Data Mining

Search WWH ::

Custom Search

Home