A Metaheuristic Approach to Two Dimensional Recursive Digital Filter Design - Advances in Heuristic Signal Processing and Applications

Digital Signal Processing Reference

In-Depth Information

This is done to limit the maximum number of plants in the colony. Initially, fast

reproduction of plants take place, and all the plants are included in the colony. The

fitter plants reproduce more than the undesirable ones. The elimination mechanism

is activated when the population exceeds a stipulated NP max . The plants and pro-

duced seeds are ranked together as a colony and plants with lower fitness values are

eliminated to limit the population count to NP max . This is the selection property of

the algorithm. The above steps are repeated until maximum number of iterations is

reached.

8.4 Differential Q-Learning

Let us consider a given agent A. Let S 1 ,S 2 ,...,S n be n possible states that can

be exhibited by agent A. Each possible state is characterized by m possible actions

a 1 ,a 2 ,...,a m . At a particular state-action pair, the specific reward that the agent

can achieve is denoted by r(S i ,a j ) and is referred to as “immediate reward” that

the agent receives for executing action a j at state S i . The basic goal of classical Q -

Learning is to choose the next action by a learning policy such that the cumulative

reward that may be acquired by the agent during subsequent transition of states from

its next state is maximized. The learning policy is achieved by updating Q -values at

each state-action pair. The higher the Q -value, the higher will be the probability of

selection of a particular action for an agent at a specified state.

Let the agent be in state S i and is executing action a j . Then the Q -value at state

S i due to action of a j is updated by

Q δ(S i ,a j ),a

Q(S i ,a j )

r(S i ,a j )

γ max

(8.24)

where 0 <γ< 1 and δ(S i ,a j ) denotes the next state due to the selection of action

a j at state S i . Let the next state selected be S k .

However, several improvements to the classical Q -Learning algorithm have been

proposed. Differential Q -learning is a modified version of Q -learning [ 27 , 28 ]. In

this approach, the Q -table update policy has the ability to remember the effect of

past Q value of a particular state-action pair while updating the corresponding Q

value. The modified Q update equation is given by

α r(S i ,a j )

Q δ(S k ,a j ),a . (8.25)

Q(S i ,a j )

←

( 1

−

α)Q(S i ,a j )

γ max

In this case, the Q -value Q(S i ,a j ) is incremented when the action a j led to a

state δ(S i ,a j ) in which there exists an action a , such that the best possible Q -value

Q(δ(S i ,a j ),a ) in the next time step plus the achieved reward r(S i ,a j ) is greater

than the current value of Q(S i ,a j ) . α is called the “learning rate”. A setting of

0 would result in a trivial scenario where no learning behavior is exhibited by

the agent, while α

1 would make the agent extremely greedy in terms of learning

behavior, thereby providing emphasis only on the most recent information. The im-

portance of future rewards is determined by the discount factor γ . Smaller values of

Advances in Heuristic Signal Processing and Applications

Search WWH ::

Custom Search

Home