THE INTERNET OF THINGS: A SURVEY FROM THE DATA-CENTRIC PERSPECTIVE - Managing and Mining Sensor Data

Database Reference

In-Depth Information

transfers may require both bandwidth and energy consumption, which

are usually a limited resource in real scenarios. Furthermore, the ana-

lytics required for such applications is often real-time, and therefore it

requires the design of methods which can provide real-time insights in a

distributed way, with communication requirements. Discussions of such

techniques for a wide variety of data mining problems can be found in

theearlierchaptersofthusbook,andalsoin[5].

In addition to the real-time insights, it is desirable to glean histori-

cal insights from the underlying data. In such cases, the insights may

need to be gleaned from massive amounts of archived sensor data. In

this context, Google's MapReduce framework [33] provides an effective

method for analysis of the sensor data, especially when the nature of the

computations involve linearly computable statistical functions over the

elements of the data streams (such as MIN, MAX, SUM, MEAN etc.). A

primer on the MapReduce framework implementation on Apache Hadoop

may be found in [115]. Google's original MapReduce framework was de-

signed for analyzing large amounts of web logs, and more specifically

deriving such linearly computable statistics from the logs. Sensor data

has a number of conceptual similarities to logs, in that they are simi-

larly repetitive, and the typical statistical computations which are often

performed on sensor data for many applications are linear in nature.

Therefore, it is quite natural to use this framework for sensor data ana-

lytics.

In order to understand this framework, let us consider the case, when

we are trying to determine the maximum temperature in each year,

from sensor data recorded over a long period of time. The Map and

Reduce functions of MapReduce are defined with respect to data struc-

turedin( key,value )pairs. The Map function, takes a list of pairs

( k 1 ,v 1 ) from one domain, returns a list of pairs ( k 2 ,v 2 ). This compu-

tation is typically performed in parallel by dividing the key value pairs

across different distributed computers. For example, in our example

above consider the case, where the data is in the form of ( year,value ),

where the year is the key. Then, the Map function, also returns a list

of ( year,local max value ) pairs, where local max value represents the

local maximum in the subset of the data processed by that node.

At this point, the MapReduce framework collects all pairs with the

same key from all lists and groups them together, thus creating one

group for each one of the different generated keys. We note that this

step requires communication between the different nodes, but the cost of

this communication is much lower than moving the original data around,

because the Map step has already created a compact summary from the

data processed within its node. We note that the exact implementation

Search WWH ::

Custom Search

Home