Digital Signal Processing Reference
In-Depth Information
This is done to limit the maximum number of plants in the colony. Initially, fast
reproduction of plants take place, and all the plants are included in the colony. The
fitter plants reproduce more than the undesirable ones. The elimination mechanism
is activated when the population exceeds a stipulated NP max . The plants and pro-
duced seeds are ranked together as a colony and plants with lower fitness values are
eliminated to limit the population count to NP max . This is the selection property of
the algorithm. The above steps are repeated until maximum number of iterations is
reached.
8.4 Differential Q-Learning
Let us consider a given agent A. Let S 1 ,S 2 ,...,S n be n possible states that can
be exhibited by agent A. Each possible state is characterized by m possible actions
a 1 ,a 2 ,...,a m . At a particular state-action pair, the specific reward that the agent
can achieve is denoted by r(S i ,a j ) and is referred to as “immediate reward” that
the agent receives for executing action a j at state S i . The basic goal of classical Q -
Learning is to choose the next action by a learning policy such that the cumulative
reward that may be acquired by the agent during subsequent transition of states from
its next state is maximized. The learning policy is achieved by updating Q -values at
each state-action pair. The higher the Q -value, the higher will be the probability of
selection of a particular action for an agent at a specified state.
Let the agent be in state S i and is executing action a j . Then the Q -value at state
S i due to action of a j is updated by
Q δ(S i ,a j ),a
Q(S i ,a j )
=
r(S i ,a j )
+
γ max
a
(8.24)
where 0 <γ< 1 and δ(S i ,a j ) denotes the next state due to the selection of action
a j at state S i . Let the next state selected be S k .
However, several improvements to the classical Q -Learning algorithm have been
proposed. Differential Q -learning is a modified version of Q -learning [ 27 , 28 ]. In
this approach, the Q -table update policy has the ability to remember the effect of
past Q value of a particular state-action pair while updating the corresponding Q
value. The modified Q update equation is given by
α r(S i ,a j )
Q δ(S k ,a j ),a . (8.25)
Q(S i ,a j )
( 1
α)Q(S i ,a j )
+
+
γ max
a
In this case, the Q -value Q(S i ,a j ) is incremented when the action a j led to a
state δ(S i ,a j ) in which there exists an action a , such that the best possible Q -value
Q(δ(S i ,a j ),a ) in the next time step plus the achieved reward r(S i ,a j ) is greater
than the current value of Q(S i ,a j ) . α is called the “learning rate”. A setting of
α
0 would result in a trivial scenario where no learning behavior is exhibited by
the agent, while α
=
1 would make the agent extremely greedy in terms of learning
behavior, thereby providing emphasis only on the most recent information. The im-
portance of future rewards is determined by the discount factor γ . Smaller values of
=
Search WWH ::




Custom Search