Database Reference
In-Depth Information
Hive, makes ad hoc queries on large data sets possible, these technologies aren't always
fast enough to match the iterative behavior and processes of data analysts.
After a number of years using MapReduce, developers at Google started to rethink
the process of running aggregate queries over large datasets. An iterative query experi-
ence requires the ability to write and run queries quickly. In 2010, Google released a
research paper that illustrated a technology known as Dremel. With Dremel, engineers
could formulate queries using an SQL-like syntax, speeding up the process of iterative
analysis without dealing with the overhead of defining raw MapReduce jobs. More
importantly, Dremel used a novel technical design that could return query results over
terabyte-scale datasets in seconds.
How Dremel and MapReduce Differ
In practice, data processing applications are often built using a collection of comple-
mentary technologies. As data f lows through various pipelines, specialized resources
are used for the steps of collection, processing, and analysis. Systems that enable batch
processing of data, such as MapReduce, and those that provide ad hoc analytical pro-
cessing, represented by Dremel, are complementary.
In Chapter 5, we looked at Apache Hive, a project that provides an SQL-like inter-
face for defining MapReduce jobs that return the results of queries. Hive enables users
to concentrate on thinking about data questions rather than the underlying MapReduce
jobs that produce query results. Superficially, Dremel looks a bit like Apache Hive, as
Dremel also features an SQL-like interface for defining queries.
MapReduce is a f lexible design. The MapReduce model enables a great variety of
different tasks to be implemented as is evident through the large ecosystem of software
available for frameworks such as Hadoop. Dealing with unstructured data is possible
by defining custom workf lows to process that data.
Dremel requires data to conform to a schema. Several data types are available,
including basic strings, numeric formats, and Booleans. Data in Dremel may be stored
as flat records or in a nested and repeated format in which individual fields in a record
can have child records of their own.
Hadoop stores data at rest in a distributed filesystem called the Hadoop Distrib-
uted Filesystem, or HDFS . Dremel stores data on disk in a columnar format. Storing
data in columns instead of rows means that only the minimum data necessary needs
to be read from disk during a query. Imagine a data table that contains information
about a person's address. One way to represent this data would be to store first name,
last name, street, city, zip code, and other elements in individual fields. If your query
requires you to find out how many people named “Michael” live in each zip code,
Dremel only needs to inspect the data in the first name and zip code columns. Use of
columnar datastores is not a technique exclusive to Dremel; we'll take a look at several
other technologies that use a columnar data structure later on.
Imagine an SQL-like query that operates over an entire table of data. The query
may do several things, including grouping, joining, and ordering. Queries that may
take many steps with MapReduce frameworks generally store their intermediate results
 
Search WWH ::




Custom Search