Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

sible, so you can call INSERT INTO TABLE...VALUES to insert small batches of

values computed in SQL. In addition, it is possible to UPDATE and DELETE rows in a

table.

HDFS does not provide in-place file updates, so changes resulting from inserts, updates,

and deletes are stored in small delta files. Delta files are periodically merged into the base

table files by MapReduce jobs that are run in the background by the metastore. These fea-

tures only work in the context of transactions (introduced in Hive 0.13.0), so the table

they are being used on needs to have transactions enabled on it. Queries reading the table

are guaranteed to see a consistent snapshot of the table.

Hive also has support for table- and partition-level locking. Locks prevent, for example,

one process from dropping a table while another is reading from it. Locks are managed

transparently using ZooKeeper, so the user doesn't have to acquire or release them, al-

though it is possible to get information about which locks are being held via the SHOW

LOCKS statement. By default, locks are not enabled.

Hive indexes can speed up queries in certain cases. A query such as SELECT * from

t WHERE x = a , for example, can take advantage of an index on column x , since only

a small portion of the table's files need to be scanned. There are currently two index types:

compact and bitmap . (The index implementation was designed to be pluggable, so it's ex-

pected that a variety of implementations will emerge for different use cases.)

Compact indexes store the HDFS block numbers of each value, rather than each file off-

set, so they don't take up much disk space but are still effective for the case where values

are clustered together in nearby rows. Bitmap indexes use compressed bitsets to effi-

ciently store the rows that a particular value appears in, and they are usually appropriate

for low-cardinality columns (such as gender or country).

SQL-on-Hadoop Alternatives

In the years since Hive was created, many other SQL-on-Hadoop engines have emerged to

address some of Hive's limitations. Cloudera Impala , an open source interactive SQL en-

gine, was one of the first, giving an order of magnitude performance boost compared to

Hive running on MapReduce. Impala uses a dedicated daemon that runs on each datanode

in the cluster. When a client runs a query it contacts an arbitrary node running an Impala

daemon, which acts as a coordinator node for the query. The coordinator sends work to

other Impala daemons in the cluster and combines their results into the full result set for

the query. Impala uses the Hive metastore and supports Hive formats and most HiveQL

constructs (plus SQL-92), so in practice it is straightforward to migrate between the two

systems, or to run both on the same cluster.

Search WWH ::

Custom Search

Home