Background - Design and Analysis of Learning Classifier Systems - page 14

Information Technology Reference

In-Depth Information

1

0.9

0

0.8

-10

0.7

-20

0.6

-30

value

0.5

-40

0.4

-50

-60

0.3

0.9

1

0.7

0.8

0.5

0.6

0.2

y

0.4

1

0.2

0.3

0.9

0.8

0.7

0.1

0.6

0.5

0.1

0.4

0.3

0

0.1

0.2

0

x

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

(a)

(b)

Fig. 2.2. Optimal policy and value function for a discretised version of the “pudd-

leworld” task [208]. The agent is located on a 1x1 square and can perform steps of

size 0.05 into either of four directions. Each step that the agent performs results in a

reward of -1, expect for actions that cause the agent to end up in a puddle, resulting

in a reward of -20. The zero-reward absorbing goal state is in the upper right corner of

the square. Thus, the task is to reach this state in the smallest number of steps while

avoiding the puddle. The circles in (a) show the location of the puddle. (b) illustrates

the optimal value function for this task, which gives the maximum expected sum of

rewards for each state, and clearly shows the impact of the high negative reward of the

puddle. Knowing this value function allows constructing the optimal policy, as given by

the arrows in (a), by choosing the action in each state that maximises the immediate

reward and the value of the next state.

by the previous hidden state and the performed action. For each such state

transitions the agent receives a scalar reward or payoff that can depend on the

previous hidden and observable state and the chosen action. The aim of the agent

is to learn which actions to perform in each observed state (called the policy )

such that the received reward is maximised in the long run.

Such a task definition is known as a Partially Observable Markov Decision

Process (POMDP) [122]. Its variables and their interaction is illustrated in

Fig. 2.1(a). It is able to describe a large number of seemingly different pro-

blems types. Consider, for example, a rat that needs to find the location of food

in a maze: in this case the rat is the agent and the maze is the environment, and

a reward of -1 is given for each movement that the rat performs until the food

is found, which leads the rat to minimise the number of required movements to

reach the food. A game of chess can also be described by a POMDP, where the

white player becomes the agent, and the black player and the chess board define

the environment. Further examples include path planning, robot control, stock

market prediction, and network routing.

While the POMDP framework allows the specification of complex tasks, fin-

ding their solution is equally complicated. Its diculty arises mostly due to the

agent not having access to the true state of the environment. Thus, most of the

recent work in LCS has focused on a special case of POMDP problems that treat

the hidden and observable states of the environment as equivalent. Such problems

Next Page

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home