On the Power of Global Reward Signals in Reinforcement Learning - Multiagent System Technologies

Information Technology Reference

In-Depth Information

remainder of this article. First, we will show that the engineered reward-based

approach has the same computational complexity and expressive power as the

state of the art approach in fully observable games, i.e. in MDPs and their multi-

agent counterparts named stochastic games (SG). Having this in mind, we then

will consider partially observable domains. Engineered reward signals in such

domains augment the expressive power compared to ordinary partially observ-

able models. However, time complexity remains and constructing an engineered

reward signal is not forbidden in state of the art models. From this it follows that

one could easily—accidentally or intentionally—supply agents with knowledge

they are not supposed to obtain, e.g. on states or joint actions. Based on this

observation, we then propose to add an additional requirement on reward sig-

nals in existing models that formally prevents such exploitation. The engineered

reward-based model in this work, thus, should mainly be understood as a means

to investigate the power of global rewards. However, as described in the context

of an example in Sect. 6, it is also usefulinproblemswithaspecial structure.

A Polynomial Time Coding Approach. A general encoding function enc

is presented in Alg. 1. Its basic idea is to create a string code ( s )# code ( r )that

combines codes of state s and reward r separated by a special character (#).

This string is then converted to a bit string, whereas each character is encoded

with the same number of bits b (e.g. b = 8 bits as in ASCII encoding). A single '1'

is added at the head of the resulting bit string for decoding reasons as explained

below. The resulting bit string is then interpreted as a number. Note that the

constructed bit strings are bounded, since rewards and state representations

are bounded, too. This approach requires time linear in the length of the bit

string. Note that states may also contain information about executed (joint)

actions. A corresponding decoding function dec is straightforward. The idea is

to interpret a received reward as bit string. By construction, the string starts

with a '1', which is removed. The remaining bit string, that might start with

zeros now, is split into substrings, each of length b ,where b equals the value

of b in the encoding procedure (e.g. b = 8 for ASCII encoding, cf. Alg. 1, line

7). Then, these substrings can be interpreted as characters, from which we can

Algorithm 1. (Encoding algorithm)

Input: state s ∈S ,reward r ∈ R

Output: engineered reward r ∈ R

1: procedure Encoding (

s, r

)

cs ←

convert state

to character array (

cs 1 ,cs 2 ,...,cs n )

cr ← convert reward r to character array ( cr 1 ,cr 2 ,...,cr m )

codeword ← Concatenate ( cs , # , cr )

create code word code ( s )# code ( r )

bitstream ← 1

initialize new bit stream with 1

for i =1 ,...,| codeword | do

bitstream ← Concatenate ( bitstream , bitcode ( codeword [ i ] ,b ))

use b bits

to encode codeword [ i ]

return ToNumber ( bitstream )

interpret bitstream as number and return it

Multiagent System Technologies

Search WWH ::

Custom Search

Home