Database Reference
In-Depth Information
5
Using Hadoop, Hive, and
Shark to Ask Questions
about Large Datasets
T he concept of data warehousing has long been in the domain of large enterprises. A
huge industry has developed to attack the problem of quickly asking questions about
data from across an entire organization. Data warehouse practices encompass the
design of complicated ETL pipelines and the art of processing data from transactional
databases using OLAP cubes and star schemas. This mature field is being challenged
by new approaches to dealing with data warehousing issues: approaches that, in some
cases, can be more scalable and performant as well as cheaper.
The Hadoop project provides an open-source platform for distributing data process-
ing tasks across clusters of low-cost commodity servers. Hadoop's implementation of
the MapReduce framework provides the ability to run batch-processing jobs on large
datasets. Asking questions about data is often an exploratory, iterative process in which
analysts expect fast results. Writing new custom MapReduce code for query after
query is far too slow and cumbersome to use in practice.
In this chapter, we'll take a look at Apache Hive, an open-source project that helps
users explore data managed by Hadoop by writing queries in a familiar, SQL-like syn-
tax. We'll also discuss some trends in interactive query for large datasets, including the
Spark and Shark projects.
What Is a Data Warehouse?
The world of enterprise data analytics is a spawning ground for poorly-defined jargon.
The kaleidoscope of terms that make up the world of data warehousing are among the
most confusing. Data warehousing is a problematic term, as it can refer to a large
number of things, including the process of organizing company data or the physical
hardware used for storage. Some data warehouses are used for canonical storage of
organizational data, whereas others are used as temporary databases simply to be used
 
 
 
 
Search WWH ::




Custom Search