Database Reference
In-Depth Information
large tables is not a great use case for a relational database. Joining the results from two
large, normalized tables may take a great deal of time.
In the classic model of disruption, organizations are finding that they can solve
some of their data challenges without the need for complicated ETL, star schemas, and
other conventions of enterprise data warehousing. Companies that deal with massive
amounts of data, such as Facebook, recognized that the existing data warehouse solu-
tions available were not able to cope, neither technologically or economically, with
the huge amounts of data being generated by Internet applications. With data sizes
approaching petabytes of data, Facebook had to turn to a data processing model that
could scale. The Apache Hadoop project was just this platform. Hadoop provides an
open-source implementation of MapReduce, a data processing framework that can
scale horizontally as data sizes increase. Even better, Hadoop makes it possible to use
clusters of low-cost servers.
Apache Hadoop has long been the media darling of open-source data process-
ing, and for good reason. Hadoop, along with the Hadoop Distributed File System
(HDFS), provides a framework for splitting data processing tasks across a collection
of different machines. With a vanilla Hadoop installation, MapReduce is provided
as an interface. In order to create a MapReduce job to process data, it's possible to
write MapReduce workf lows using a scripting language such as BASH or Python (see
Chapter 8, “Putting It Together: MapReduce Data Pipelines”), using a library such as
Cascading, or using the workf low language Pig (see Chapter 9, “Building Data Trans-
formation Workf lows with Pig and Cascading”).
A MapReduce job is a great way to facilitate batch processing of unstructured data.
For example, MapReduce can be used to convert the values in a huge collection of
raw text files from one data type to another.
Relational databases are useful tools for asking questions about structured datasets.
Can you use a MapReduce job to ask a similar question? You definitely can, but it is
not as easy to write the code for interactive query jobs when in an interactive environ-
ment. Running a query over data is a different process than that of defining batch-
processing jobs. With a data processing job, there's no expectation of instantaneous
results. Data analysis lends itself to ad hoc exploration: As soon as a query is complete,
you may want to run more follow-up queries immediately. Even experienced develop-
ers would find it cumbersome to write out a new script for every iteration of a query.
Additionally, the act of querying is often done by analysts and decision makers who
may not be specialists at writing code or managing Hadoop instances.
Relational databases also tend to support standard versions of the Structured Query
Language, or SQL, which is well known and easy to learn. SQL supports common
database functions such as selecting the results of mathematical operations, joining
query results from different tables, and grouping results together by a particular value.
SQL is very expressive, but the biggest advantage for providing access to SQL is that
the language is well understood by a large number of people, meaning that analysts
can use it without needing to know how to write code.
 
Search WWH ::




Custom Search