Information Technology Reference
In-Depth Information
easily reconstruct the code word code ( s )# code ( r ). Finally, this word is split
at the separator (#) and the resulting parts represent the original state resp.
reward. Obviously, this decoding also runs in polynomial time. By construction
it holds that dec ( enc ( R ( s, a ) ,s )) = ( R ( s, a ) ,s ). Clearly, other polynomial time
approaches exist and can be used if they obey the requirements stated above.
4
Complexity Results and Expressiveness
Based on [6] and [8], we first briefly recall some theoretical foundations. Com-
plexity theory classifies computational problems into complexity classes. For
convenience, important techniques are often designed for decision problems, i.e.
problems whose solution is either yes or no . Reinforcement learning, however,
is not a decision problem but an optimization problem, as agents seek to max-
imize the expected return. Hence, to investigate RL problems using complexity
theory, we have to convert them into decision problems. A simple and common
technique is to add a bound B to an optimization problem O and to ask if O
can be solved such that the optimization objective is at least (or at most) B .
Clearly, such a converted problem is at most as hard as the original problem.
Different complexity classes exist, among which
P
and
NP
are fundamental.
Decision problems that belong to the class
P
are solvable in polynomial time,
while those belonging to
can be solved using nondeterministic polynomial
time algorithms. It is known that
NP
.
To show that two problems fall into the same complexity class, a technique
called polynomial time reduction is used. Therefore, one needs a polynomial
time reduction function f : O → O that i) transforms any instance o ∈ O of a
decision problem into an instance f ( o )
P⊆NP
and widely believed that
P
=
NP
∈ O , and which ii) ensures that if and
only if o evaluates to yes then also f ( o )evaluatesto yes .Wewrite O ≤ p O to
denote that a problem O is polynomial time reducible to a problem O .
In the remainder of this section, we will show that the proposed concep-
tual model is equivalent to the state of the art model that provides both, state
and reward signals to the agent. To do so, we first have to formally define our
understanding of equivalent models. Therefore, we adopt the same notion and
approach as Seuken and Zilberstein [17] and use polynomial time reductions:
Definition 1. Two models M 1 and M 2 are said to be polynomial time equiva-
lent if and only if M 1 p M 2 and M 2 p M 1 .
By this definition, two models are polynomial time equivalent if they can express
the same problems and belong to the same complexity class.
Next, we formally define the models that were illustrated in the previous
section. There, Fig. 1(a) basically illustrates a Markov decision process (MDP)
which, based on [9], can be defined as follows:
Definition 2. (Markov Decision Process) A Markov decision process (MDP) is
defined as a tuple
is a finite set of world states and A is
a finite set of possible actions. The state-transition function δ :
S,A,δ,R
,where
S
S×A → Π (
S
)
Search WWH ::




Custom Search