Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

CREATE TABLE rcfile_table(

name string,

) STORED AS RCFILE;

Using Additional Data Sources with Hive

Hive doesn't just open up a world of relational queries to users. An attractive feature of

Hive is its ability to access and even write to other types of data sources.

There are many key-value data stores, but one that is well known to users of

Hadoop is HBase. HBase is a great solution when you have lots of data coming into

the system and have a need to quickly retrieve data by key. Hive can be configured to

use HBase as a data source. For more information on when to use nonrelational and

key-value data stores, see Chapter 3, “Building a NoSQL-Based Web App to Collect

Crowd-Sourced Data.”

Another useful way to interact with Hive is through drivers. Hive is an open-

source project, and there are available JDBC and ODBC drivers that can be used as an

interface to external programs. Finally, Hive can interact directly with data coming

from Thrift servers (for more about Thrift, see Chapter 2, “Hosting and Sharing Tera-

bytes of Raw Data”).

Apache Hive is essentially a system to translate SQL-like queries into Hadoop

MapReduce jobs. MapReduce is meant to be used as a batch process, intended more

for f lexibility than raw speed. This isn't the optimal underlying design for interactive

queries, in which user results are iterative. One of the design goals of Hadoop—that

data is sharded across disks on many machines and processing happens on the same

nodes—means that data tends to be read and written to disk quite often. In terms of

performance, disk I/O is often one of the main bottlenecks for data processing tasks.

Hive queries can often result in a multistage MapReduce process, meaning that in the

course of a single query there can be plenty of disk reads and writes.

Recognizing that Hadoop isn't the optimal tool for every data use case, some devel-

opers have been rethinking the underlying technology for use cases such as distributed

processing. An exciting new development in the open-source data world is Spark, a

project created at the UC Berkeley AMPLab. Spark is a distributed processing frame-

work like Hadoop, but it attempts to use system memory to improve performance.

Spark's core data model is based on objects called Resilient Distributed Datasets, or

RDDs. An RDD lives in system memory and is available without the need for disk

access. Spark is currently on the list of Apache Incubator projects, which is a step

toward it eventually becoming an officially supported Apache project like Hadoop

and Hive, and it is mature enough that it is being used in production by several well-

known technology companies.

Search WWH ::

Custom Search

Home