Digital Signal Processing Reference
In-Depth Information
using a soft-max policy, where each node selects an action with a probability given
by the Boltzmann distribution [127]:
e Q(s c ,s n )
T
s A (s c ) e Q(s c ,s)
p Q (s c ,s n )
=
,
(7.5)
T
where T is the temperature that controls the amount of exploration. For higher val-
ues of T the actions are equiprobable. By annealing the algorithm (cooling it down)
the policy becomes more and more greedy. We use the following annealing scheme,
denoting θ the annealing factor:
T k + 1
θT k .
(7.6)
To further improve network-wide throughput and fairness, we allow transmitters
to tune their power. It has been established that the best response is to always send
at a higher power. This will lead to the Nash equilibrium, where all terminals are
using the maximum power. Hence, we need to give nodes a small incentive to scale
down the power. We do this by introducing a cost for using higher powers:
r (p) (s n ,s c )
=
ρ(i n )S(s n )
ρ(i c )S(s c ),
(7.7)
=
where i is the power index ( i
0 refers to the lowest power). The reward with power
rewards is denoted as r (p) . The reward factors are defined as follows:
ρ i :
ρ(i)
=
i
∈[
0 ,n p
1
]
,
(7.8)
where ρ is an element of ( 0 , 1
]
and n p is the number of available transmission
powers.
With a high ρ , nodes will scale down their power, until they see a drop in through-
put. This is somewhat similar to the power control mechanism described in [94].
Withalower ρ , they will even accept a throughput reduction.
Similar to the heuristic recommendation for starvation-free scenarios (see
Sect. 7.3.4 ), we allow links with a good channel to scale down their power without
dropping their throughput. As a result interference levels drop for the surround-
ing links. These links may now be able to send at a higher rate, which improves
network-wide throughput and fairness.
7.3.6 Seeding the Learning Engine with the DT Procedures
For each (combination of) scenario(s), we have defined heuristic recommendations
in Sect. 7.3.4 . For instance, when a node is dealing with asymmetric starvation, it
makes sense to either increase the power or decrease the carrier sense threshold in
order to alleviate this situation. At DT, we however do not know which of the two
actions is better.
Hence, we need to incorporate the heuristic recommendations in the Q-learning
mechanism, described above. The idea is that the heuristic recommendations are
followed during the exploration phase and that when the temperature cools down,
Search WWH ::




Custom Search