Database Reference
In-Depth Information
Comparison with Traditional Databases
Although Hive resembles a traditional database in many ways (such as supporting a SQL
interface), its original HDFS and MapReduce underpinnings mean that there are a number
of architectural differences that have directly influenced the features that Hive supports.
Over time, however, these limitations have been (and continue to be) removed, with the
result that Hive looks and feels more like a traditional database with every year that passes.
Schema on Read Versus Schema on Write
In a traditional database, a table's schema is enforced at data load time. If the data being
loaded doesn't conform to the schema, then it is rejected. This design is sometimes called
schema on write because the data is checked against the schema when it is written into the
database.
Hive, on the other hand, doesn't verify the data when it is loaded, but rather when a query
is issued. This is called schema on read .
There are trade-offs between the two approaches. Schema on read makes for a very fast ini-
tial load, since the data does not have to be read, parsed, and serialized to disk in the data-
base's internal format. The load operation is just a file copy or move. It is more flexible,
too: consider having two schemas for the same underlying data, depending on the analysis
being performed. (This is possible in Hive using external tables; see Managed Tables and
External Tables . )
Schema on write makes query time performance faster because the database can index
columns and perform compression on the data. The trade-off, however, is that it takes
longer to load data into the database. Furthermore, there are many scenarios where the
schema is not known at load time, so there are no indexes to apply, because the queries
have not been formulated yet. These scenarios are where Hive shines.
Updates, Transactions, and Indexes
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until re-
cently, these features have not been considered a part of Hive's feature set. This is because
Hive was built to operate over HDFS data using MapReduce, where full-table scans are the
norm and a table update is achieved by transforming the data into a new table. For a data
warehousing application that runs over large portions of the dataset, this works well.
Hive has long supported adding new rows in bulk to an existing table by using INSERT
INTO to add new data files to a table. From release 0.14.0, finer-grained changes are pos-
Search WWH ::




Custom Search