Graphics Reference
In-Depth Information
cloud congestion levels are both independent of the actions taken by the devices
within their clusters. This makes each device independent, since the decisions made
by other devices do not affect the reward.
12.5.3 Relation to Prior Work
Each mobile device of Fig. 12.4 seeks to maximize its own expected recognition
rate at the minimum possible cost in terms of utilized wireless resources (i.e., MAC
superframe transmission opportunities used). To this end, several approaches have
been proposed that are based on reinforcement learning [ 36 ], such as Q-learning [ 30 ].
In these, the goal is to learn the state-value function, which provides a measure of the
expected long-term performance (utility). However, they incur large memory over-
heads for storing the state-value function, and they are slow to adapt to new or dynam-
ically changing environments. A better approach is to intermittently explore and
exploit when needed, in order to capture such changes. Index policies for multi-armed
bandit (MAB) problems, contextual bandits [ 22 , 33 ], or epsilon-decreasing algo-
rithms [ 3 ] can be used for this task. However, all existing bandit frameworks do not
take into consideration the contention and congestion conditions as contexts in the
application under consideration.
12.5.4 Learning Based on Multi-user Bandits
Motivated by the lack of efficient methods that fully capture the problems related to
online learning in multi-user wireless networks and cloud computing systems with
uncertain and highly-varying resource provisioning, an online systematic learning
theory based on multi-user contextual bandits has been developed. This learning
theory can be viewed as a natural extension of the basic MAB framework. Analytic
estimates have been derived to compare its efficiency against the complete knowledge
(or “oracle”) benchmark in which the expected reward of every choice is known by
the learner. Unlike Q-learning [ 36 ] and other learning-based methods, it is proven
that the regret bound—the loss incurred by the algorithm against the best possible
decision that assumes full knowledge of contention and congestion conditions—is
logarithmic if users do not collaborate and each would like to maximize the user's
own utility. Finally, the contextual bandit framework discussed here is general, and
can be used for learning in various kinds of wireless embedded computer vision
applications that involve offloading of selected processing tasks. Henceforth in this
chapter, we refer to the contextual bandit framework by the abbreviation CBF .
Search WWH ::




Custom Search