Database Reference
In-Depth Information
transfers may require both bandwidth and energy consumption, which
are usually a limited resource in real scenarios. Furthermore, the ana-
lytics required for such applications is often real-time, and therefore it
requires the design of methods which can provide real-time insights in a
distributed way, with communication requirements. Discussions of such
techniques for a wide variety of data mining problems can be found in
theearlierchaptersofthusbook,andalsoin[5].
In addition to the real-time insights, it is desirable to glean histori-
cal insights from the underlying data. In such cases, the insights may
need to be gleaned from massive amounts of archived sensor data. In
this context, Google's MapReduce framework [33] provides an effective
method for analysis of the sensor data, especially when the nature of the
computations involve linearly computable statistical functions over the
elements of the data streams (such as MIN, MAX, SUM, MEAN etc.). A
primer on the MapReduce framework implementation on Apache Hadoop
may be found in [115]. Google's original MapReduce framework was de-
signed for analyzing large amounts of web logs, and more specifically
deriving such linearly computable statistics from the logs. Sensor data
has a number of conceptual similarities to logs, in that they are simi-
larly repetitive, and the typical statistical computations which are often
performed on sensor data for many applications are linear in nature.
Therefore, it is quite natural to use this framework for sensor data ana-
lytics.
In order to understand this framework, let us consider the case, when
we are trying to determine the maximum temperature in each year,
from sensor data recorded over a long period of time. The Map and
Reduce functions of MapReduce are defined with respect to data struc-
turedin( key,value )pairs. The Map function, takes a list of pairs
( k 1 ,v 1 ) from one domain, returns a list of pairs ( k 2 ,v 2 ). This compu-
tation is typically performed in parallel by dividing the key value pairs
across different distributed computers. For example, in our example
above consider the case, where the data is in the form of ( year,value ),
where the year is the key. Then, the Map function, also returns a list
of ( year,local max value ) pairs, where local max value represents the
local maximum in the subset of the data processed by that node.
At this point, the MapReduce framework collects all pairs with the
same key from all lists and groups them together, thus creating one
group for each one of the different generated keys. We note that this
step requires communication between the different nodes, but the cost of
this communication is much lower than moving the original data around,
because the Map step has already created a compact summary from the
data processed within its node. We note that the exact implementation
Search WWH ::




Custom Search