Database Reference
In-Depth Information
sible, so you can call INSERT INTO TABLE...VALUES to insert small batches of
values computed in SQL. In addition, it is possible to UPDATE and DELETE rows in a
table.
HDFS does not provide in-place file updates, so changes resulting from inserts, updates,
and deletes are stored in small delta files. Delta files are periodically merged into the base
table files by MapReduce jobs that are run in the background by the metastore. These fea-
tures only work in the context of transactions (introduced in Hive 0.13.0), so the table
they are being used on needs to have transactions enabled on it. Queries reading the table
are guaranteed to see a consistent snapshot of the table.
Hive also has support for table- and partition-level locking. Locks prevent, for example,
one process from dropping a table while another is reading from it. Locks are managed
transparently using ZooKeeper, so the user doesn't have to acquire or release them, al-
though it is possible to get information about which locks are being held via the SHOW
LOCKS statement. By default, locks are not enabled.
Hive indexes can speed up queries in certain cases. A query such as SELECT * from
t WHERE x = a , for example, can take advantage of an index on column x , since only
a small portion of the table's files need to be scanned. There are currently two index types:
compact and bitmap . (The index implementation was designed to be pluggable, so it's ex-
pected that a variety of implementations will emerge for different use cases.)
Compact indexes store the HDFS block numbers of each value, rather than each file off-
set, so they don't take up much disk space but are still effective for the case where values
are clustered together in nearby rows. Bitmap indexes use compressed bitsets to effi-
ciently store the rows that a particular value appears in, and they are usually appropriate
for low-cardinality columns (such as gender or country).
SQL-on-Hadoop Alternatives
In the years since Hive was created, many other SQL-on-Hadoop engines have emerged to
address some of Hive's limitations. Cloudera Impala , an open source interactive SQL en-
gine, was one of the first, giving an order of magnitude performance boost compared to
Hive running on MapReduce. Impala uses a dedicated daemon that runs on each datanode
in the cluster. When a client runs a query it contacts an arbitrary node running an Impala
daemon, which acts as a coordinator node for the query. The coordinator sends work to
other Impala daemons in the cluster and combines their results into the full result set for
the query. Impala uses the Hive metastore and supports Hive formats and most HiveQL
constructs (plus SQL-92), so in practice it is straightforward to migrate between the two
systems, or to run both on the same cluster.
Search WWH ::




Custom Search