Database Reference
In-Depth Information
the Watkins
Q
-learning, off-policy algorithm and
WatkinsQLambdaAgent
for the
Watkins
Q
(
), and off-policy algorithm. Again the names of all parameters are
consistent to [SB98].
λ
Example 12.22
We give an example of a modified GridWorld example
representing an episodic task (in contrast to our previous GridWorld, which was a
continuing task). This new GridWorld has a terminal state after which the episode
terminates. Here the reward is
1 for all transitions; thus, we want to reach the
terminal state as fast as possible. Like in the previous example, we omit the
implementation of the
GridEnvironment
but focus on the solution process.
// Create agent settings:
TDAgentSettings agentSettings
¼
new TDAgentSettings();
agentSettings.setInputDataSpecification(metaData);
agentSettings.setGamma(1.0);
agentSettings.setAlpha(0.01);
agentSettings.setLambda(0.9);
agentSettings.verifySettings();
// Get default agent specification from 'agents.xml':
AgentSpecification agentSpecification
¼
AgentSpecification.getAgentSpecification("SarsaLambda
Agent" );
// Create algorithm object with default values:
RLAgent agent
¼
(RLAgent) agentSpecification.createAgen-
tInstance();
// Put it all together:
agent.setAgentSettings(agentSettings);
agent.verify();
// Create environment:
Environment env
¼
new GridEnvironment();
// Create and init simulation object:
Simulation sim
¼
new Simulation(agent, env);
sim.init(null); // assigns environment to agent
// Run simulation:
int numTrials
¼
10000;
int maxStepsPerTrial
¼
100;
sim.setTrialDevisor(1000);
sim.trials(numTrials, maxStepsPerTrial);
System.out.println("total time [s]: " + sim.getTimeSpent
ToRunTrials() );
■