Information Technology Reference
In-Depth Information
4.2 Action State
Let A be a set of actions, A=R, where, R is a set of the resources in the workflow
model. Let a ( s )= candidates ( s.t ) be a set of the optional actions in the state s ,
where s.t means the task t is allocated in state s .
4.3 Reward Function
In this paper, the objective is minimizing the flow time of the business process
cases. So the reward function should be defined as:
r =
1
˃.fl
, ( tisNone )
(3)
0 ,
( else )
Where, t is a task will be allocated, ˃ is the current determinative case, ˃.fl is
the flow time of the case. The shorter flow time ˃.fl takes, the higher immediate
payoff r returns. The agent will receive the reward when the last work item is
completed, otherwise the reward is zero.
5 Q-learning Algorithm for MDPs
Q-learning is a more suitable algorithm to solve the problem modeled by MDPs
[15]. In this section, the detail of the proposed Q-learning algorithm will be de-
scribed. In order to verify the influence of the previous resources on the candidate
resources, the Q-learning algorithm without SR is used as the basic experiment.
The basic experiment will be described in the first part, and the second part will
introduce the Q-learning algorithm with SR.
5.1 The Q-learning Algorithm without SR
In order to realize the Q-learning algorithm, the evaluation function Q ( s, a )
and the estimation function Q ( s, a ) was defined [15]. According to MDPs, the
evaluation function Q ( s, a ) can be written as:
( r ( s, a )+ ʳmax a Q ( ʴ ( s, a ) ,a )
Q ( s, a )
(4)
We can use Q to evaluate the Q function, then the agent can update Q ( s, a )
according to the following rules:
Q ( s, a )
r + ʳmax a Q ( s ,a )
(5)
For the situation of non deterministic reward and action, formula (5) will be
replaced by the following formula:
Q n ( s, a )
ʱ n ) Q n− 1 ( s, a )+ ʱ n [ r + ʳmax a Q n− 1 ( s ,a )]
(1
(6)
Where,
1
1+ visits n ( s, a )
ʱ n =
 
Search WWH ::




Custom Search