Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

The policy is the decision-making function of the agent, specifying what

action it takes in each of the situations that it might encounter. In psychology,

this would correspond to the set of stimulus-response rules or associations. This

is the core of a reinforcement agent, as suggested by Figure 10.2, because it alone

is sufficient to define a full, behaving agent. The other components serve only to

change and improve the policy. The policy itself is the ultimate determinant of

behavior and performance. In general it may be stochastic.

The reward function defines the goal of the RL agent. The agent's objective is

to maximize the reward that it receives over the long run. The reward function

thus defines what are good and bad events for the agent. Rewards are the

immediate and defining features of the problem faced by the agent. As such, the

reward function must necessarily be fixed. It may, however, be used as a basis

for changing the policy. For example, if an action selected by the policy is

followed by low reward then the policy may be changed to select some other

action in that situation in the future.

Whereas reward indicates what is good in an immediate sense, the transition

function specifies what is good in the long run, that is, because it predicts reward.

The difference between value and reward is critical to RL. For example, when

playing chess, checkmating your opponent is associated with high reward, but

winning his queen is associated with high value. The former defines the true goal

of the task - winning the game - whereas the latter just predicts this true goal.

Learning the value of states, or of state-action pairs, is the critical step in the RL

methods.

The fourth and final major component of a RL agent is a model of its

environment or external world. This is something that mimics the behavior of the

environment in some sense. Not every RL agent uses model of the environment.

Methods that never learn or use a model are called model-free RL methods.

Model-free methods are very simple and, perhaps surprisingly, are still generally

able to find optimal behavior. Model-based methods just find it faster. The most

interesting case is that in which the agent does not have a perfect model of the

environment a priori, but must use learning methods to align it with reality.

The system environment is defined by the environment model. Because the

model of

function are unknown, the system can only rely on the

immediate reward received by each trial and error to choose policies. The

objective is to find a policy that, roughly speaking, maximizes the total reward

received. In the simplest formulation, the tradeoff between immediate and

delayed reward is handled by a discount rate 0 ± ȳ 1. The value of following a

P

and

R

Search WWH ::

Custom Search

Home