Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

NOTE

The load operation is very fast because it is just a move or rename within a filesystem. However, bear in

mind that Hive does not check that the files in the table directory conform to the schema declared for the

table, even for managed tables. If there is a mismatch, this will become apparent at query time, often by

the query returning NULL for a missing field. You can check that the data is being parsed correctly by is-

suing a simple SELECT statement to retrieve a few rows directly from the table.

If the table is later dropped, using:

DROP TABLE managed_table;

the table, including its metadata and its data, is deleted. It bears repeating that since the

initial LOAD performed a move operation, and the DROP performed a delete operation, the

data no longer exists anywhere. This is what it means for Hive to manage the data.

An external table behaves differently. You control the creation and deletion of the data.

The location of the external data is specified at table creation time:

CREATE EXTERNAL TABLE external_table (dummy STRING)

LOCATION '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

With the EXTERNAL keyword, Hive knows that it is not managing the data, so it doesn't

move it to its warehouse directory. Indeed, it doesn't even check whether the external loc-

ation exists at the time it is defined. This is a useful feature because it means you can cre-

ate the data lazily after creating the table.

When you drop an external table, Hive will leave the data untouched and only delete the

metadata.

So how do you choose which type of table to use? In most cases, there is not much differ-

ence between the two (except of course for the difference in DROP semantics), so it is a

just a matter of preference. As a rule of thumb, if you are doing all your processing with

Hive, then use managed tables, but if you wish to use Hive and other tools on the same

dataset, then use external tables. A common pattern is to use an external table to access an

initial dataset stored in HDFS (created by another process), then use a Hive transform to

move the data into a managed Hive table. This works the other way around, too; an ex-

ternal table (not necessarily on HDFS) can be used to export data from Hive for other ap-

plications to use. [ 111 ]

Another reason for using external tables is when you wish to associate multiple schemas

with the same dataset.

Search WWH ::

Custom Search

Home