Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

infrastructure costs and capacity can be scaled to need, the economic model for cloud

data warehouse applications may be more affordable.

Summary

The relational database architecture of the traditional data warehouse concept is being

challenged by new, disruptive open-source technologies. When data sizes grow very

large, commercial data warehouse solutions can be economically prohibitive for many

organizations. Some data challenges, like those involving completely unstructured

data, do not lend themselves to the use of relational tables, star schemas, and compli-

cated ETL processes inherent in the enterprise database world.

The Apache Hadoop project provides a framework for processing data using clus-

ters of commodity hardware. Hadoop, along with the underlying Hadoop Distributed

File System (HDFS), is able to scale horizontally as more and more data is added to

the system. The processing model that Hadoop provides, MapReduce, is designed to

enable data processing to take place as near as possible to the distributed data storage.

This makes complex batch processing of data across the network possible—usually as

a result of streaming MapReduce scripts, Apache Pig workf lows, or complete applica-

tions using a high-level language such as Java.

Hadoop isn't just for data processing; it is possible to use MapReduce to find the

answers to aggregate queries involving sums, grouping, joins, and other functions.

However, the procedure of querying datasets often requires an iterative process. Query

tasks can also require multiple MapReduce jobs to attain the result, and it can be

cumbersome to iteratively develop code to define complex MapReduce workf lows.

The open-source Apache Hive project, originally started at Facebook, was created to

provide an SQL-like interface for Hadoop in order to speed up the process of writing

iterative queries over data stored in HDFS.

Hive is a project that applies some of the concepts from data warehousing to the

Hadoop framework. Unlike traditional data warehousing applications built using

relational databases, Hive defines tables and indexes for data available to the Hadoop

Distributed File System (HDFS). This enables users of Hive to interrogate datasets

through an SQL-like query language called HiveQL. Hive's query language does not

support all the functions of standard SQL-92, but it does provide some features spe-

cific to the MapReduce paradigm, such as the ability to provide multiple table output

for query results. Hive supports native data types based mostly around the datatypes

available to Java: various types of integer formats, f loating-point numbers, strings, and

more. Hive also provides support for arrays, maps, and custom structs.

Hive can access a variety of formats natively, including text data, Hadoop Sequence

Files, and the columnar RCFile format. Data files can be completely managed by Hive

itself, or data can be referenced to external locations, making it possible for Hive to

coexist with existing MapReduce applications and workflows. It is also possible to

improve query result speed by applying partition and index information to Hive tables.

Because Hive is an interface to Hadoop, it is possible to create user-defined functions

Search WWH ::

Custom Search

Home