Information Technology Reference
In-Depth Information
Lemma 6.10
Let dp be a derivative policy in a pLTS. Then
(a) If s dp ʔ and s dp ʘ then ʔ = ʘ.
(b) For every state s there exists some ʔ such that s
dp ʔ.
Proof
ʘ as in Def-
inition 6.4 , via the subdistributions ʔ k , ʔ k , ʔ k and ʘ k , ʘ k , ʘ k , respectively.
Because both derivations are guided by the same derivative policy dp it is easy to
show by induction on k that
To prove part (a) consider the derivation of s
ʔ and s
ʔ k =
ʘ k
ʔ k
ʘ k
ʔ k =
ʘ k
=
from which ʔ
ʘ follows immediately.
To prove (b) we use dp to generate subdistributions ʔ k , ʔ k
=
, ʔ k
for each
k
0 satisfying the constraints of Definition 6.4 and simultaneously those in
Definition 6.12 above. The result will then follow by letting ʔ be k 0 ʔ k .
The net effect of this lemma is that a derivative policy dp determines a total
function from states to derivations. L et Der dp : S
D
( S ) be defined by letting
Der dp ( s ) be the unique ʔ such that s
dp ʔ .
It should be clear that the use of derivative policies limits considerably the scope
for deriving weak derivations. Each particular policy can only derive one weak
derivative, and moreover in finitary pLTS there are only a finite number of derivative
policies. Nevertheless, we will show that this limitation is more apparent than real. In
Sect. 4.6.1, we saw how the more restrictive extreme policies ep could in fact, realise
the maximum value attainable by any resolution of a finitely branching pLTS. Here,
we generalise this result by replacing resolutions with arbitrary reward functions.
Definition 6.13 (Rewards and Payoffs) Let S be a set of states. A reward function
is a function r : S
[
1, 1]. With S
={
s 1 , ... , s n }
, we often consider a reward
function as the n -dimensional vector
r ( s 1 ), ... , r ( s n )
. In this way, we can use the
notion r
ʔ to stand for the inner product of two vectors.
Given such a reward function, we define the payoff function
·
r max : S
P
ₒ R
by
r max ( s )
P
=
sup
{
r
·
ʔ
|
s
ʔ
}
.
A priori these payoff functions for a given state s are determined by the set of
weak derivatives of s . However, the main result of this section is that they can, in
fact, always be realised by derivative policies.
Theorem 6.8 (Realising Payoffs) In a finitary pLTS, for every reward function r
there exists some derivative policy dp such that
r max ( s )
Der dp ( s ) .
Proof As with Theorem 4.3, there is a temptation to give a constructive proof here,
defining the effect of the required derivative policy dp at state s by considering the
application of the reward function r to both s and all of its derivatives. However, this
is not possible, as the example below explains.
Instead the proof is nonconstructive, requiring discounted policies. The overall
structure of the proof is similar to that of Theorem 4.3, but the use of (discounted)
P
=
r
·
 
Search WWH ::




Custom Search