Information Technology Reference
In-Depth Information
where Q s
[0 , 1] is the quality of slot s within the frame of agent i . Intuitively, a
high Q s value indicates that it is beneficial for agent i to stay awake during slot s .This
quality value is updated using the previous Q-value ( Q s ) for that slot, the learning rate
α
[0 , 1] for the event e that (just) occurred
in slot s . Thus, nodes will update as many Q-values as there are events during its active
period. In other words, agent i will update the value Q s for each slot s where an event
e occurred. The latter update scheme differs from that of traditional Q-learning [16],
where only the Q-value of the selected action is updated. The motivation behind this
update scheme is presented in subsection 2.5. In addition, we set here the future discount
parameter γ to 0 , since our agents are stateless (or single-state).
Nodes will stay awake for those consecutive time slots that have the highest sum of
Q-values. Put differently, each agent selects the action a s (i.e., wake up at slot s )that
maximizes the sum of the Q-values for the D consecutive time slots, where D is the
duty cycle, fixed by the user. Formally, agent i will wake up at slot s ,where
[0 , 1] , and the newly obtained reward r s,e
D
s =argmax
s∈S
Q s + j
j =0
For example, if the required duty cycle of the nodes is set to 10% ( D =10 for a frame
of S = 100 slots), each node will stay active for those 10 consecutive slots within its
frame that have the highest sum of Q-values. Conversely, for all other slots the agent will
remain asleep, since its Q-values indicate that it is less beneficial to stay active during
that time. Nodes will update the Q-value of each slot for which an event occurrs within
its duty cycle. Thus, when forwarding messages to the sink, over time, nodes acquire
sufficient information on “slot quality” to determine the best period within the frame
to stay awake. This behavior makes neighboring nodes (de)synchronize their actions,
resulting in faster message delivery and thus lower end-to-end latency.
2.5
Exploration
As explained in the above two subsections, active time slots are updated individually,
regardless of when the node wakes up. The reason for this choice is threefold. Firstly,
this allows each slot to be explored and updated more frequently. For example, slot s
will be updated when the node wakes up anywhere between slots s
D +1 ,
i.e. in D out of S possible actions. Secondly, updating individual Q-values makes it pos-
sible to alter the duty cycle of nodes at run time (as suggest some preliminary results,
not displayed in this paper) without invalidating the Q-values of slots. In contrast, if a
Q-value was computed for each start slot s , i.e. the reward was accumulated over the
wake duration and stored at slot s only, changing the duty cycle at run-time will render
the computed Q-values useless, since the reward was accumulated over a different du-
ration. In addition, slot s will be updated only when the agent wakes up at that slot. A
separate exploration strategy is therefore required to ensure that this action is explored
sufficiently. Thirdly, our exploration scheme will continuously explore and update not
only the wake-up slot, but all slots within the awake period. Treating slots individually
results in an implicit exploration scheme that requires no additional tuning.
1 and s
 
Search WWH ::




Custom Search