Pattern Recognition within Coarse-Grained Networks - Internet-Scale Pattern Recognition

Databases Reference

In-Depth Information

8.3.1 Cloud Data Access Scheme

Data access schemes for a cloud infrastructure perform some important

tasks, i.e., to administer a distribution of data across different networks and

to provide data services for remote clients. In this section, we are going to

discuss a cloud data access scheme using Google's MapReduce technique.

Google's MapReduce is a programming model intended for large-scale data

processing in a massively parallel manner. It was developed to solve issues in-

volving parallelization of computational processes and data distribution across

heterogeneous networks. The MapReduce implementation also addresses load

balancing, network performance, and fault tolerance issues [95].

The MapReduce programming model was inspired by other primitive lan-

guages, such as Lisp. It involves two functions: map and reduce. The map

function is written by users and takes an input pair and produces a set of

intermediate key/value pairs. Intermediate values associated with the same

intermediate key are grouped by the MapReduce library and passed to the

reduce function. The reduce function, also written by the user, merges all the

intermediate values to form a possibly smaller set of values. Typically each

invocation of the reduce function produces zero or one output.

Consider the following examples of map and reduce functions. Given a mul-

tiplication operation in a function f (z), the following procedures illustrate

both the map and reduce applications:

f (z) = map (×2, (2, 4, 6)) → ((2×2) , (4×2) , (6×2)) = (4, 8, 12)

f (z) = reduce (×, (2, 4, 6)) → ((2×4)×6) = 48

Note that the map function is able to run the operation in parallel for all

the inputs, whereas the reduce function works sequentially from left to right.

In the data access mechanism, the map and reduce functions are used to

retrieve data from a collection of distributed repositories. The map function

extracts the desired information based on a condition set by the user (it could

be the condition within an SQL query). It works on the atomic level of data

(a tuple or a file). The reduce function performs an operation on the data

retrieved by the map function and obtains a set of values or a single value, as

required by the user.

An important feature of MapReduce is its ability to parallelize the opera-

tions by working on each individual data and performing these tasks on-site.

Consider the following example. Suppose there is a set of data related to em-

ployees' personal details, as shown in Table 8.3. An SQL query is performed to

retrieve the average salary per department for executive employees as follows:

With this SQL query, MapReduce will conduct the map operation to obtain

the name and salary amount of each employee in a department. Consequently,

the reduce function will calculate the average salary according to each depart-

ment. Figure 8.9 shows these operations.

Some problems arise in this type of processing configuration. For example,

the map function conducts its operation assuming that data are distributed

Internet-Scale Pattern Recognition

Search WWH ::

Custom Search

Home