Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

In general, the growing demand for large-scale data mining and data analysis

applications has spurred the development of novel solutions from both the industry

(e.g., web-data analysis, click-stream analysis, network-monitoring log analysis) and

the sciences (e.g., analysis of data produced by massive-scale simulations, sensor

deployments, high-throughput lab equipment). Although parallel database systems

[46] serve some of these data analysis applications (e.g., Teradata,* SQL Server

PDW, † Ver t ica , ‡ Greenplum, § ParAccel, ¶ Netezza**), they are expensive, difficult to

administer and lack fault tolerance for long-running queries [113]. MapReduce [43]

is a framework that is introduced by Google for programming commodity computer

clusters to perform large-scale data processing in a single pass. The framework is

designed such that a MapReduce cluster can scale to thousands of nodes in a fault-

tolerant manner. One of the main advantages of this framework is its reliance on

a simple and powerful programming model. In addition, it isolates the application

developer from all the complex details of running a distributed program such as:

issues on data distribution, scheduling, and fault tolerance [112].

Recently, there has been a great deal of hype about cloud computing [11]. In prin-

ciple, cloud computing is associated with a new paradigm for the provisioning of

computing infrastructure. This paradigm shifts the location of this infrastructure

to more centralized and larger-scale datacenters to reduce the costs associated with

the management of hardware and software resources. In particular, cloud computing

has promised a number of advantages for hosting the deployments of data-intensive

applications such as

•

Reduced time-to-market by removing or simplifying the time-consuming

hardware provisioning, purchasing, and deployment processes

•

Reduced monetary cost by following a pay-as-you-go business model

•

Unlimited (virtually) throughput by adding servers if the workload increases

In principle, the success of many enterprises often rely on their ability to analyze

expansive volumes of data. In general, cost-effective processing of large data sets is

a nontrivial undertaking. Fortunately, MapReduce frameworks and cloud comput-

ing have made it easier than ever for everyone to step into the world of Big Data.

This technology combination has enabled even small companies to collect and ana-

lyze terabytes of data to gain a competitive edge. For example, the Amazon Elastic

Compute Cloud (EC2) †† is offered as a commodity that can be purchased and uti-

lized. In addition, Amazon has also provided the Amazon Elastic MapReduce ‡‡ as an

online service to easily and cost-effectively process vast amounts of data without the

need to worry about time-consuming setup, management, or tuning of computing

* http://teradata.com/.

† http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/pdw.aspx.

‡ http://www.vertica.com/.

§ http://www.greenplum.com/.

¶ http://www.paraccel.com/.

** http://www-01.ibm.com/software/data/netezza/.

†† http://aws.amazon.com/ec2/.

‡‡ http://aws.amazon.com/elasticmapreduce/.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home