Information Technology Reference
In-Depth Information
The policy is the decision-making function of the agent, specifying what
action it takes in each of the situations that it might encounter. In psychology,
this would correspond to the set of stimulus-response rules or associations. This
is the core of a reinforcement agent, as suggested by Figure 10.2, because it alone
is sufficient to define a full, behaving agent. The other components serve only to
change and improve the policy. The policy itself is the ultimate determinant of
behavior and performance. In general it may be stochastic.
The reward function defines the goal of the RL agent. The agent's objective is
to maximize the reward that it receives over the long run. The reward function
thus defines what are good and bad events for the agent. Rewards are the
immediate and defining features of the problem faced by the agent. As such, the
reward function must necessarily be fixed. It may, however, be used as a basis
for changing the policy. For example, if an action selected by the policy is
followed by low reward then the policy may be changed to select some other
action in that situation in the future.
Whereas reward indicates what is good in an immediate sense, the transition
function specifies what is good in the long run, that is, because it predicts reward.
The difference between value and reward is critical to RL. For example, when
playing chess, checkmating your opponent is associated with high reward, but
winning his queen is associated with high value. The former defines the true goal
of the task - winning the game - whereas the latter just predicts this true goal.
Learning the value of states, or of state-action pairs, is the critical step in the RL
methods.
The fourth and final major component of a RL agent is a model of its
environment or external world. This is something that mimics the behavior of the
environment in some sense. Not every RL agent uses model of the environment.
Methods that never learn or use a model are called model-free RL methods.
Model-free methods are very simple and, perhaps surprisingly, are still generally
able to find optimal behavior. Model-based methods just find it faster. The most
interesting case is that in which the agent does not have a perfect model of the
environment a priori, but must use learning methods to align it with reality.
The system environment is defined by the environment model. Because the
model of
function are unknown, the system can only rely on the
immediate reward received by each trial and error to choose policies. The
objective is to find a policy that, roughly speaking, maximizes the total reward
received. In the simplest formulation, the tradeoff between immediate and
delayed reward is handled by a discount rate 0 ± ȳ 1. The value of following a
P
and
R
Search WWH ::




Custom Search