Digital Signal Processing Reference
In-Depth Information
Fig. 7.5 In each state the
transmitter can decide (if
possible) to stay, increase
rate, decrease T CS or decrease
rate. When decreasing the
rate, T CS is reset to T CS [ 0 ] .
The states where T CS is
smaller than T CS [ i ]
have been
pruned at DT
7.3.5 The Learning Engine
To find the optimal operating point, we propose to use a learning based approach to
refine the heuristic recommendations. The proposed learning scheme is Q-learning,
which does not need a model of the environment and can be used on line [67].
The Q-learning algorithm continuously estimates the values of state-action pairs.
In the current problem, an action is the transition from the current state ( s c )toa
new state ( s n ). The value Q(s c ,s n ) is defined as the expected discounted sum of
future payoffs obtained by going to the new state and following an optimal policy
thereafter. Once these values have been learned, the optimal action from any state
is the one with the highest Q-value. The standard procedure for Q-learning is as
follows. All Q-values are initialized to 0. During the exploration of the state space,
the Q-values are updated as follows:
Q(s c ,s n )
( 1
α)Q(s c ,s n )
+ α ( 1
A (s n ) Q(s n ,s) ,
γ)r + γ
max
(7.3)
s
where α is the forget factor and γ is the learning parameter.
To implement Q-learning we first need to define the actions within the state space.
For simplicity reasons, we discuss first the actions without power flexibility. The
possible actions for an example with 3 available rates can be seen in Fig. 7.5 . In each
state the transmitter can decide (if possible) to keep the current configuration, to
increase the rate, to decrease the carrier sense threshold or to decrease the rate. When
decreasing the rate, the transmitter is forced to reset its T CS to T CS [
0
]
. The states
where T CS is smaller than T CS [
have been pruned as explained in Sect. 7.3.2.3 .
The reward r is defined as the throughput increase for going to state s n :
r
i
]
=
S(s n )
S(s c ),
(7.4)
where S(s) is the throughput seen in state s .
The Q-learning algorithm updates the estimate for each action, but in fact does
not specify what actions should be taken. It allows arbitrary experimentation while
at the same time preserving the current best estimate of the states' values. For now,
we will use the following instantiation of simulated annealing. The nodes explore
Search WWH ::




Custom Search