The Big Picture: Toward a Synthesis of RL and Adaptive Tensor Factorization - Realtime Data Mining

Database Reference

In-Depth Information

s :¼ s 2 ; ...;

s l

Þ:

Then the transition probabilities of M are stipulated as

p s , s 0 ,

s 0

¼ s ,

p ss 0 ;s ðÞ :¼

otherwise, a

∈

Similarly, the rewards are taken to be

s 0

r s, s 0 ,

¼ s ,

r ss 0 ;s ðÞ :¼

otherwise, a

∈

s j j ∈ N S be a trajectory of M and define

Now let

s j j ∈ N

j ∈ N ,

! S N

S N

s jkþ 1 ; ...;

s j

where we made use of the convention

:¼ s 1 ; ...;

, j k þ 1

s jkþ 1 ; ...;

s j

Conversely, consider

j ∈ N

s j j ∈ N :

: S N

! S N

s j

;

s j

g ¼ id, i.e., the identical mapping, and all trajectories of S

not contained in h ( S N ) have vanishing probability. Furthermore, any trajectory of

M has the same probability and reward sequence as its image under h .

By virtue of this result, it is straightforward to verify that the state-value function

v of a given policy

Then we have h

∘

satisfies the Bellman equation

v s 2 , ... , s l s 0 s 0

v s 1 , ... , s l ¼ X

Þ X p s 1 , ... , s l s 0

r s 1 , ... , s l s 0 þ γ

A π

s 1 ; ...;

s l

ð 10

1 Þ

∈

Also by means of state space augmentation, we may devise a k -MDP generaliza-

tion of temporal-difference learning. Given a transition from state s to s 0 given the

history s ¼ ( s 1 ,

...

, s l 1 ), the update rule reads as

:¼ v þ α

zd ðÞ ,

ð 10

2 Þ

Where

, z

dðÞ:¼ r s ;ð , s 0

v s ;ðÞ γ

v s ;s ðÞ

:¼ λγ

z þ e s ;ð :

ð 10

3 Þ

Realtime Data Mining

Search WWH ::

Custom Search

Home