Changing Not Just Analyzing: Control Theory and Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

3.1 Modeling

RL was based originally on methods of dynamic programming (DP, the mathemat-

ical theory of optimal control), albeit that in machine learning, the theories and

terminology have since been developed beyond DP. Central to this - as is usual in

AI - is the term agent . Figure 3.1 shows the interaction between agent and

environment in reinforcement learning.

The agent passes into a new state (s ), for which it receives a reward (r )fromthe

environment, whereupon it decides on a new action (a ) from the admissible action set

for s ( A(s) ), by which in most cases it learns, and the environment responds in turn to

this action, etc. In such cases, we differentiate between episodic tasks , which come to

an end (as in a game), and continuing tasks without any end state (such as a service

robot which moves around indefinitely). The goal of the agent consists in selecting the

actions in each state so as to maximize the sum of all rewards over the entire episode.

The selection of the actions by the agent is referred to as its policy

, and that policy

which results in maximizing the sum of all rewards is referred to as the optimal policy .

Example 3.1 As the first example for RL, we can consider a robot, which is

required to reach a destination as quickly as possible. The states are its coordinates,

the actions are the selection of the direction of travel, and the reward at every step

is 1. In order to maximize the sum of rewards over the entire episode, the robot

must achieve its goal in the fewest possible steps.

π

■

Example 3.2 A further example is chess once again, where the positions of the

pieces are the states, the moves are the actions, and the reward is always 0 except in

the final position, at which it is 1 for a win, 0 for a draw, and 1 for a loss (this is

what we call a delayed reward ).

■

Example 3.3 A final example, to which we will dedicate more intensive study, is

recommendation engines. Here, for instance, the product detail views are the states,

the recommended products are the actions, and the purchases of the products are the

rewards.

■

Agent

reward r t

action a t

state s t

r t +1

Environment

s t +1

Fig. 3.1 The interaction between agent and environment in RL

Search WWH ::

Custom Search

Home