Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Comparison with Traditional Databases

Although Hive resembles a traditional database in many ways (such as supporting a SQL

interface), its original HDFS and MapReduce underpinnings mean that there are a number

of architectural differences that have directly influenced the features that Hive supports.

Over time, however, these limitations have been (and continue to be) removed, with the

result that Hive looks and feels more like a traditional database with every year that passes.

Schema on Read Versus Schema on Write

In a traditional database, a table's schema is enforced at data load time. If the data being

loaded doesn't conform to the schema, then it is rejected. This design is sometimes called

schema on write because the data is checked against the schema when it is written into the

database.

Hive, on the other hand, doesn't verify the data when it is loaded, but rather when a query

is issued. This is called schema on read .

There are trade-offs between the two approaches. Schema on read makes for a very fast ini-

tial load, since the data does not have to be read, parsed, and serialized to disk in the data-

base's internal format. The load operation is just a file copy or move. It is more flexible,

too: consider having two schemas for the same underlying data, depending on the analysis

being performed. (This is possible in Hive using external tables; see Managed Tables and

External Tables . )

Schema on write makes query time performance faster because the database can index

columns and perform compression on the data. The trade-off, however, is that it takes

longer to load data into the database. Furthermore, there are many scenarios where the

schema is not known at load time, so there are no indexes to apply, because the queries

have not been formulated yet. These scenarios are where Hive shines.

Updates, Transactions, and Indexes

Updates, transactions, and indexes are mainstays of traditional databases. Yet, until re-

cently, these features have not been considered a part of Hive's feature set. This is because

Hive was built to operate over HDFS data using MapReduce, where full-table scans are the

norm and a table update is achieved by transforming the data into a new table. For a data

warehousing application that runs over large portions of the dataset, this works well.

Hive has long supported adding new rows in bulk to an existing table by using INSERT

INTO to add new data files to a table. From release 0.14.0, finer-grained changes are pos-

Search WWH ::

Custom Search

Home