Database Reference
In-Depth Information
CREATE TABLE rcfile_table(
name string,
) STORED AS RCFILE;
Using Additional Data Sources with Hive
Hive doesn't just open up a world of relational queries to users. An attractive feature of
Hive is its ability to access and even write to other types of data sources.
There are many key-value data stores, but one that is well known to users of
Hadoop is HBase. HBase is a great solution when you have lots of data coming into
the system and have a need to quickly retrieve data by key. Hive can be configured to
use HBase as a data source. For more information on when to use nonrelational and
key-value data stores, see Chapter 3, “Building a NoSQL-Based Web App to Collect
Crowd-Sourced Data.”
Another useful way to interact with Hive is through drivers. Hive is an open-
source project, and there are available JDBC and ODBC drivers that can be used as an
interface to external programs. Finally, Hive can interact directly with data coming
from Thrift servers (for more about Thrift, see Chapter 2, “Hosting and Sharing Tera-
bytes of Raw Data”).
Shark: Queries at the Speed of RAM
Apache Hive is essentially a system to translate SQL-like queries into Hadoop
MapReduce jobs. MapReduce is meant to be used as a batch process, intended more
for f lexibility than raw speed. This isn't the optimal underlying design for interactive
queries, in which user results are iterative. One of the design goals of Hadoop—that
data is sharded across disks on many machines and processing happens on the same
nodes—means that data tends to be read and written to disk quite often. In terms of
performance, disk I/O is often one of the main bottlenecks for data processing tasks.
Hive queries can often result in a multistage MapReduce process, meaning that in the
course of a single query there can be plenty of disk reads and writes.
Recognizing that Hadoop isn't the optimal tool for every data use case, some devel-
opers have been rethinking the underlying technology for use cases such as distributed
processing. An exciting new development in the open-source data world is Spark, a
project created at the UC Berkeley AMPLab. Spark is a distributed processing frame-
work like Hadoop, but it attempts to use system memory to improve performance.
Spark's core data model is based on objects called Resilient Distributed Datasets, or
RDDs. An RDD lives in system memory and is available without the need for disk
access. Spark is currently on the list of Apache Incubator projects, which is a step
toward it eventually becoming an officially supported Apache project like Hadoop
and Hive, and it is mature enough that it is being used in production by several well-
known technology companies.
 
 
Search WWH ::




Custom Search