Information Technology Reference
In-Depth Information
s ,
, is generable by the convex closure of a finite set. The proof makes
use of the separation theorem, Theorem 2.7.
{ ʔ | s ʔ }
Theorem 6.9 (Finite Generability) Let P
be the finite set of
deriva ti ve policies in a finitary pLTS. For any state s and subdistribution ʔ in the
pLTS, s
={
dp 1 , ... , dp k }
ʔ if and only if ʔ
∈{
Der dp i ( s ) |
1
i
k
}
.
Proof
(
) For convenience, l e t X denote the set
{
Der dp i ( s )
|
1
i
k
}
. Sup-
pose, for a contradiction, that s
ʔ for some ʔ not in X . Recall that we are
assuming that the underlying state space is S
={
x 1 , ...x n }
so that X is a subset of
1, 1] n , and by definition it is convex; by Lemma 6.9
it follows that X is (Cauchy) closed.
By the separation theorem, ʔ can be separated from X by a hyperplane H . What
this means is that there is some function r H : S
n . It is trivially bounded by [
R
ₒ R
and constant c
∈ R
such that
either
(a) r H ·
ʘ<c for all ʘ
X and r H ·
ʔ>c or,
(b) r H ·
ʘ>c for all ʘ
X and r H ·
ʔ<c
In fact, from case (b) we can obtain case (a) by negating both the constant c and
the components of the function r H ; so we can assume (a) to be true. Moreover, by
scaling with respect to the largest r H ( s i ),
1
i
n , we can assume that r H is
actually a reward function.
In particular (a) means that r H ·
Der dp i ( s ) <c , and therefore that
r H ·
Der dp i ( s ) < r H · ʔ
for each of derivative polices dp i . But this contradicts Theorem 6.8 which claims
that there must be some 1
r max ( s )
i
n such that r H ·
Der dp i ( s )
= P
r H ·
ʔ .
(
) Note that by definition s
Der dp i ( s ) for every derivative policy dp i with
1
i
k . By Proposition 6.3 and Remark 6.1 , the relation
is convex. It follows
that if ʔ
Extreme policies, as given in Definition 4.8, are particular kinds of derivative
policies, designed for pLTSs of the form
∈{
Der dp 1 ( s ), ... , Der dp k }
, then s
ʔ .
R , ʩ ˄ ,
R
. The significant constraint on
˄
−ₒ
extreme policies is that for any state s if s
then ep ( s ) must be defined. As a
consequence, in the computation determined by ep if a state can contribute to the
computation at any stage it must contribute.
Lemma 6.11 Let ep be any extreme policy. Then s ep ʔ implies s ʔ.
Proof Consider the derivation of ʔ as in Definition 6.4 , and determined by the
extreme policy ep . Since ʔ
= k 0 ʔ k
it is sufficient to show that each ʔ k
is
˄
−ₒ
implies s ʔ k
stable, that is s
.
Since ep is an extreme policy, Definition 4.8 ensures that ep ( s ) is defined. From
the definition of a computation, Definition 6.4 , we know ʔ k =
ʔ k
ʔ k
+
and
since the computation is guided by the policy ep we have that ʔ k
( s )
=
ʔ k ( s ). An
immediate consequence is that ʔ k ( s )
=
0.
 
Search WWH ::




Custom Search