Reinforcement Learning for Self-organizing Wake-Up Scheduling inWireless Sensor Networks - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

where Q s ∈

[0 , 1] is the quality of slot s within the frame of agent i . Intuitively, a

high Q s value indicates that it is beneficial for agent i to stay awake during slot s .This

quality value is updated using the previous Q-value ( Q s ) for that slot, the learning rate

[0 , 1] for the event e that (just) occurred

in slot s . Thus, nodes will update as many Q-values as there are events during its active

period. In other words, agent i will update the value Q s for each slot s where an event

e occurred. The latter update scheme differs from that of traditional Q-learning [16],

where only the Q-value of the selected action is updated. The motivation behind this

update scheme is presented in subsection 2.5. In addition, we set here the future discount

parameter γ to 0 , since our agents are stateless (or single-state).

Nodes will stay awake for those consecutive time slots that have the highest sum of

Q-values. Put differently, each agent selects the action a s (i.e., wake up at slot s )that

maximizes the sum of the Q-values for the D consecutive time slots, where D is the

duty cycle, fixed by the user. Formally, agent i will wake up at slot s ,where

∈

[0 , 1] , and the newly obtained reward r s,e ∈

s =argmax

s∈S

Q s + j

j =0

For example, if the required duty cycle of the nodes is set to 10% ( D =10 for a frame

of S = 100 slots), each node will stay active for those 10 consecutive slots within its

frame that have the highest sum of Q-values. Conversely, for all other slots the agent will

remain asleep, since its Q-values indicate that it is less beneficial to stay active during

that time. Nodes will update the Q-value of each slot for which an event occurrs within

its duty cycle. Thus, when forwarding messages to the sink, over time, nodes acquire

sufficient information on “slot quality” to determine the best period within the frame

to stay awake. This behavior makes neighboring nodes (de)synchronize their actions,

resulting in faster message delivery and thus lower end-to-end latency.

2.5

Exploration

As explained in the above two subsections, active time slots are updated individually,

regardless of when the node wakes up. The reason for this choice is threefold. Firstly,

this allows each slot to be explored and updated more frequently. For example, slot s

will be updated when the node wakes up anywhere between slots s

D +1 ,

i.e. in D out of S possible actions. Secondly, updating individual Q-values makes it pos-

sible to alter the duty cycle of nodes at run time (as suggest some preliminary results,

not displayed in this paper) without invalidating the Q-values of slots. In contrast, if a

Q-value was computed for each start slot s , i.e. the reward was accumulated over the

wake duration and stored at slot s only, changing the duty cycle at run-time will render

the computed Q-values useless, since the reward was accumulated over a different du-

ration. In addition, slot s will be updated only when the agent wakes up at that slot. A

separate exploration strategy is therefore required to ensure that this action is explored

sufficiently. Thirdly, our exploration scheme will continuously explore and update not

only the wake-up slot, but all slots within the awake period. Treating slots individually

results in an implicit exploration scheme that requires no additional tuning.

−

1 and s

−

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home