Distributed Optimization of Local Area Networks - Software Defined Radios

Digital Signal Processing Reference

In-Depth Information

Fig. 7.5 In each state the

transmitter can decide (if

possible) to stay, increase

rate, decrease T CS or decrease

rate. When decreasing the

rate, T CS is reset to T CS [ 0 ] .

The states where T CS is

smaller than T CS [ i ]

have been

pruned at DT

7.3.5 The Learning Engine

To find the optimal operating point, we propose to use a learning based approach to

refine the heuristic recommendations. The proposed learning scheme is Q-learning,

which does not need a model of the environment and can be used on line [67].

The Q-learning algorithm continuously estimates the values of state-action pairs.

In the current problem, an action is the transition from the current state ( s c )toa

new state ( s n ). The value Q(s c ,s n ) is defined as the expected discounted sum of

future payoffs obtained by going to the new state and following an optimal policy

thereafter. Once these values have been learned, the optimal action from any state

is the one with the highest Q-value. The standard procedure for Q-learning is as

follows. All Q-values are initialized to 0. During the exploration of the state space,

the Q-values are updated as follows:

Q(s c ,s n )

←

( 1

α)Q(s c ,s n )

+ α ( 1

−

∈ A (s n ) Q(s n ,s) ,

− γ)r + γ

max

(7.3)

s

where α is the forget factor and γ is the learning parameter.

To implement Q-learning we first need to define the actions within the state space.

For simplicity reasons, we discuss first the actions without power flexibility. The

possible actions for an example with 3 available rates can be seen in Fig. 7.5 . In each

state the transmitter can decide (if possible) to keep the current configuration, to

increase the rate, to decrease the carrier sense threshold or to decrease the rate. When

decreasing the rate, the transmitter is forced to reset its T CS to T CS [

0

]

. The states

where T CS is smaller than T CS [

have been pruned as explained in Sect. 7.3.2.3 .

The reward r is defined as the throughput increase for going to state s n :

r

i

]

=

S(s n )

−

S(s c ),

(7.4)

where S(s) is the throughput seen in state s .

The Q-learning algorithm updates the estimate for each action, but in fact does

not specify what actions should be taken. It allows arbitrary experimentation while

at the same time preserving the current best estimate of the states' values. For now,

we will use the following instantiation of simulated annealing. The nodes explore

Software Defined Radios

Search WWH ::

Custom Search

Home