Database Reference
In-Depth Information
NOTE
The load operation is very fast because it is just a move or rename within a filesystem. However, bear in
mind that Hive does not check that the files in the table directory conform to the schema declared for the
table, even for managed tables. If there is a mismatch, this will become apparent at query time, often by
the query returning NULL for a missing field. You can check that the data is being parsed correctly by is-
suing a simple SELECT statement to retrieve a few rows directly from the table.
If the table is later dropped, using:
DROP TABLE managed_table;
the table, including its metadata and its data, is deleted. It bears repeating that since the
initial LOAD performed a move operation, and the DROP performed a delete operation, the
data no longer exists anywhere. This is what it means for Hive to manage the data.
An external table behaves differently. You control the creation and deletion of the data.
The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
With the EXTERNAL keyword, Hive knows that it is not managing the data, so it doesn't
move it to its warehouse directory. Indeed, it doesn't even check whether the external loc-
ation exists at the time it is defined. This is a useful feature because it means you can cre-
ate the data lazily after creating the table.
When you drop an external table, Hive will leave the data untouched and only delete the
metadata.
So how do you choose which type of table to use? In most cases, there is not much differ-
ence between the two (except of course for the difference in DROP semantics), so it is a
just a matter of preference. As a rule of thumb, if you are doing all your processing with
Hive, then use managed tables, but if you wish to use Hive and other tools on the same
dataset, then use external tables. A common pattern is to use an external table to access an
initial dataset stored in HDFS (created by another process), then use a Hive transform to
move the data into a managed Hive table. This works the other way around, too; an ex-
ternal table (not necessarily on HDFS) can be used to export data from Hive for other ap-
plications to use. [ 111 ]
Another reason for using external tables is when you wish to associate multiple schemas
with the same dataset.
Search WWH ::




Custom Search