Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

large tables is not a great use case for a relational database. Joining the results from two

large, normalized tables may take a great deal of time.

In the classic model of disruption, organizations are finding that they can solve

some of their data challenges without the need for complicated ETL, star schemas, and

other conventions of enterprise data warehousing. Companies that deal with massive

amounts of data, such as Facebook, recognized that the existing data warehouse solu-

tions available were not able to cope, neither technologically or economically, with

the huge amounts of data being generated by Internet applications. With data sizes

approaching petabytes of data, Facebook had to turn to a data processing model that

could scale. The Apache Hadoop project was just this platform. Hadoop provides an

open-source implementation of MapReduce, a data processing framework that can

scale horizontally as data sizes increase. Even better, Hadoop makes it possible to use

clusters of low-cost servers.

Apache Hadoop has long been the media darling of open-source data process-

ing, and for good reason. Hadoop, along with the Hadoop Distributed File System

(HDFS), provides a framework for splitting data processing tasks across a collection

of different machines. With a vanilla Hadoop installation, MapReduce is provided

as an interface. In order to create a MapReduce job to process data, it's possible to

write MapReduce workf lows using a scripting language such as BASH or Python (see

Chapter 8, “Putting It Together: MapReduce Data Pipelines”), using a library such as

Cascading, or using the workf low language Pig (see Chapter 9, “Building Data Trans-

formation Workf lows with Pig and Cascading”).

A MapReduce job is a great way to facilitate batch processing of unstructured data.

For example, MapReduce can be used to convert the values in a huge collection of

raw text files from one data type to another.

Relational databases are useful tools for asking questions about structured datasets.

Can you use a MapReduce job to ask a similar question? You definitely can, but it is

not as easy to write the code for interactive query jobs when in an interactive environ-

ment. Running a query over data is a different process than that of defining batch-

processing jobs. With a data processing job, there's no expectation of instantaneous

results. Data analysis lends itself to ad hoc exploration: As soon as a query is complete,

you may want to run more follow-up queries immediately. Even experienced develop-

ers would find it cumbersome to write out a new script for every iteration of a query.

Additionally, the act of querying is often done by analysts and decision makers who

may not be specialists at writing code or managing Hadoop instances.

Relational databases also tend to support standard versions of the Structured Query

Language, or SQL, which is well known and easy to learn. SQL supports common

database functions such as selecting the results of mathematical operations, joining

query results from different tables, and grouping results together by a particular value.

SQL is very expressive, but the biggest advantage for providing access to SQL is that

the language is well understood by a large number of people, meaning that analysts

can use it without needing to know how to write code.

Search WWH ::

Custom Search

Home