Database Reference
In-Depth Information
In general, the growing demand for large-scale data mining and data analysis
applications has spurred the development of novel solutions from both the industry
(e.g., web-data analysis, click-stream analysis, network-monitoring log analysis) and
the sciences (e.g., analysis of data produced by massive-scale simulations, sensor
deployments, high-throughput lab equipment). Although parallel database systems
[46] serve some of these data analysis applications (e.g., Teradata,* SQL Server
PDW, Ver t ica , Greenplum, § ParAccel, Netezza**), they are expensive, difficult to
administer and lack fault tolerance for long-running queries [113]. MapReduce [43]
is a framework that is introduced by Google for programming commodity computer
clusters to perform large-scale data processing in a single pass. The framework is
designed such that a MapReduce cluster can scale to thousands of nodes in a fault-
tolerant manner. One of the main advantages of this framework is its reliance on
a simple and powerful programming model. In addition, it isolates the application
developer from all the complex details of running a distributed program such as:
issues on data distribution, scheduling, and fault tolerance [112].
Recently, there has been a great deal of hype about cloud computing [11]. In prin-
ciple, cloud computing is associated with a new paradigm for the provisioning of
computing infrastructure. This paradigm shifts the location of this infrastructure
to more centralized and larger-scale datacenters to reduce the costs associated with
the management of hardware and software resources. In particular, cloud computing
has promised a number of advantages for hosting the deployments of data-intensive
applications such as
Reduced time-to-market by removing or simplifying the time-consuming
hardware provisioning, purchasing, and deployment processes
Reduced monetary cost by following a pay-as-you-go business model
Unlimited (virtually) throughput by adding servers if the workload increases
In principle, the success of many enterprises often rely on their ability to analyze
expansive volumes of data. In general, cost-effective processing of large data sets is
a nontrivial undertaking. Fortunately, MapReduce frameworks and cloud comput-
ing have made it easier than ever for everyone to step into the world of Big Data.
This technology combination has enabled even small companies to collect and ana-
lyze terabytes of data to gain a competitive edge. For example, the Amazon Elastic
Compute Cloud (EC2) †† is offered as a commodity that can be purchased and uti-
lized. In addition, Amazon has also provided the Amazon Elastic MapReduce ‡‡ as an
online service to easily and cost-effectively process vast amounts of data without the
need to worry about time-consuming setup, management, or tuning of computing
* http://teradata.com/.
http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/pdw.aspx.
http://www.vertica.com/.
§ http://www.greenplum.com/.
http://www.paraccel.com/.
** http://www-01.ibm.com/software/data/netezza/.
†† http://aws.amazon.com/ec2/.
‡‡ http://aws.amazon.com/elasticmapreduce/.
Search WWH ::




Custom Search