Information Technology Reference
In-Depth Information
6.9
Real-Reward Testing
In Sect. 4.4, we introduced a notion of reward testing inspired by [ 12 ]. The idea is to
associate with each success action a nonnegative reward, and performing a success
action means accumulating some reward. The outcomes of this reward testing are
nonnegative expected rewards.
In certain occasions, it is very natural to introduce negative rewards. For example,
this is the case in the theory of Markov decision processes [ 6 ]. Intuitively, we could
understand negative rewards as costs, while positive rewards are often viewed as
benefits or profits. Consider, for instance, the (nonprobabilistic) processes Q 1 and
Q 2 with initial states q 1 and q 2 , respectively, in Fig. 6.5 . Here a represents the action
of making an investment. Assuming that the investment is made by bidding for some
commodity, the ˄ -action represents an unsuccessful bid—if this happens one simply
tries again. Now b represents the action of reaping the benefits of this investment.
Whereas Q 1 models a process in which making the investment is always followed by
an opportunity to reap the benefits, the process Q 2 allows, nondeterministically, for
the possibility that the investment is unsuccessful, so that a does not always lead to a
state where b is enabled. The test T with initial state t , which will be explained later,
allows us to give a negative reward to action ˉ 1 —its cost—and a positive reward to
ˉ 2 .
This leads to the question: if both negative- and positive rewards are allowed, how
would the original reward-testing semantics change? 3 We refer to the more relaxed
form of testing as real-reward testing and the original one as nonnegative-reward
testing .
The power of real-reward testing is illustrated in Fig. 6.5 . The two (nonprobabilis-
tic) processes in the left- and central diagrams are equivalent under (probabilistic)
may- as well as must testing; the ˄ -loops in the initial states cause both processes
to fail any nontrivial must test. Yet, if a reward of
2 is associated with performing
the action ˉ 1 , and a reward of 4 with the subsequent performance of ˉ 2 , it turns out
that in the first process the net reward is either 0, if the process remains stuck in its
initial state, or positive, whereas running the second process may yield a loss. See
Example 6.27 for details of how these rewards are assigned, and how net rewards are
associated with the application of tests such as T . This example shows that for pro-
cesses that may exhibit divergence, real-reward testing is more discriminating than
nonnegative-reward testing, or other forms of probabilistic testing. It also illustrates
that the extra power is relevant in applications.
3
1]
can be converted into a nonnegative assignment simply by adding 1 to all of them. But that would
not preserve the testing order in the case of zero-outcomes that resulted from a process' s failing to
reach any success state at all: those zeroes would remain zero.
One might suspect no change at all, for any assignment of rewards from the interval [
1,
+
Search WWH ::




Custom Search