Reinforcement Learning for Self-organizing Wake-Up Scheduling inWireless Sensor Networks - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

the Q-values of slots are updated sufficiently enough to make the policy of the agents

converge. We measured that after 80 iterations (or frames) on average the actions of

agents do not change any more and thus the state of (de)synchronicity has been reached.

In other words, after 800 seconds each node finds the wake-up schedule that improves

message throughput and minimizes communication interference. This duration is suffi-

ciently small compared to the lifetime of the system for a static WSN, which is in the

order of several days up to a couple of years depending on the duty cycle and the hard-

ware characteristics [4]. However, it is still unclear under which conditions convergence

proofs can be brought. Further research is therefore required to better characterize the

convergence criteria.

Despite the improvements that our approach offers over the standard S-MAC proto-

col, we discuss here two shortcomings that need to be addressed. First of all, the duty

cycle set by the user of the system affects all nodes equally. In other words, all nodes

are active for the same amount of time. Depending on their position in the network,

however, nodes require different duration for their active periods. Nodes close to the

sink are subject to heavier traffic load compared to leaf nodes, whose active time need

not be as high. The second shortcoming of our technique concerns the coordination of

actions among active agents. Clearly, being awake at the same time is not sufficient for

two nodes to successfully exchange messages. If two agents on the same routing branch

attempt to transmit at the same slot, their messages will collide. Agents therefore need

to learn not only the time of their active period within a frame, but also when to transmit

and when to listen during that active period.

The above two shortcomings are being addressed in an extension of our algorithm,

which we call DESYDE [9]. The three main differences to the proposed approach are

outlined below:

1. In DESYDE we let agents learn two quality values for each slot, instead of one.

One quality value indicates how beneficial it is for the node to transmit during that

slot, while the other value indicates how good it is to listen for messages. In slots

where it is neither good to transmit nor to listen, the node will turn off its antenna

and enter sleep mode. Thus, each node learns the quality of three actions: transmit ,

listen and sleep , as opposed to only wake-up and sleep .

2. The algorithm in DESYDE differs from the one proposed in this paper also in the

value of the learning rate α . In DESYDE we set this value to 1 , which dramatically

alters the learning behavior of nodes. With α =1 , nodes remember only the most

recently observed feedback signal for each slot and discard old observations. In this

way the behavior of nodes resembles a Win-Stay Lose-Shift strategy [13] where in

our setting agents at each slot repeat the action that was successful at the same slot

in the previous frame and try a different action if it was unsuccessful.

3. The last difference is the action selection method — in DESYDE nodes select at

each slot the action with the highest expected reward, rather than staying awake for

the slots with the highest sum of Q-values. If none of the two quality values are

above 0 for a given slot, the agent selects sleep in that slot in the next frame. In this

way nodes adapt their duty cycle to the traffic load of the network and may wake

up at different slots within a frame, as opposed to holding only one active period.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home