Database Reference
In-Depth Information
infrastructure costs and capacity can be scaled to need, the economic model for cloud
data warehouse applications may be more affordable.
Summary
The relational database architecture of the traditional data warehouse concept is being
challenged by new, disruptive open-source technologies. When data sizes grow very
large, commercial data warehouse solutions can be economically prohibitive for many
organizations. Some data challenges, like those involving completely unstructured
data, do not lend themselves to the use of relational tables, star schemas, and compli-
cated ETL processes inherent in the enterprise database world.
The Apache Hadoop project provides a framework for processing data using clus-
ters of commodity hardware. Hadoop, along with the underlying Hadoop Distributed
File System (HDFS), is able to scale horizontally as more and more data is added to
the system. The processing model that Hadoop provides, MapReduce, is designed to
enable data processing to take place as near as possible to the distributed data storage.
This makes complex batch processing of data across the network possible—usually as
a result of streaming MapReduce scripts, Apache Pig workf lows, or complete applica-
tions using a high-level language such as Java.
Hadoop isn't just for data processing; it is possible to use MapReduce to find the
answers to aggregate queries involving sums, grouping, joins, and other functions.
However, the procedure of querying datasets often requires an iterative process. Query
tasks can also require multiple MapReduce jobs to attain the result, and it can be
cumbersome to iteratively develop code to define complex MapReduce workf lows.
The open-source Apache Hive project, originally started at Facebook, was created to
provide an SQL-like interface for Hadoop in order to speed up the process of writing
iterative queries over data stored in HDFS.
Hive is a project that applies some of the concepts from data warehousing to the
Hadoop framework. Unlike traditional data warehousing applications built using
relational databases, Hive defines tables and indexes for data available to the Hadoop
Distributed File System (HDFS). This enables users of Hive to interrogate datasets
through an SQL-like query language called HiveQL. Hive's query language does not
support all the functions of standard SQL-92, but it does provide some features spe-
cific to the MapReduce paradigm, such as the ability to provide multiple table output
for query results. Hive supports native data types based mostly around the datatypes
available to Java: various types of integer formats, f loating-point numbers, strings, and
more. Hive also provides support for arrays, maps, and custom structs.
Hive can access a variety of formats natively, including text data, Hadoop Sequence
Files, and the columnar RCFile format. Data files can be completely managed by Hive
itself, or data can be referenced to external locations, making it possible for Hive to
coexist with existing MapReduce applications and workflows. It is also possible to
improve query result speed by applying partition and index information to Hive tables.
Because Hive is an interface to Hadoop, it is possible to create user-defined functions
 
 
Search WWH ::




Custom Search