Database Reference
In-Depth Information
Hive, however, uses Hadoop as its data storage system. Therefore, the data
sits in HDFS and is accessible to anyone with access to the file system. This
does make it easier to manage the data and add new information, but you
must be aware that other processes can manipulate the data.
One of the primary differences between Hive and most relational systems
is that data in Hive can only be selected, inserted, or deleted; there is no
update capability. This is due to Hive using Hadoop file storage for its
data. As noted in Chapter 8, “Effective Big Data ETL with SSIS, Pig, and
Sqoop,” Hadoop is a write-once, read-many file system. If you need to
change something in a file, you delete the original and write a new version
of the file. Because Hive manages table data using Hadoop, the same
constraints apply to Hive. There are also no row-based operations. Instead,
everything is done in bulk mode.
Another key difference is that the data structure is defined up-front in
traditional relational databases. The columns of a table, their data types,
and any constraints on what the column can hold are set when the table
is created. The database server enforces that any data written to the table
conformstotherulessetupwhenthetablewascreated.Thisisreferredtoas
schema on write ; the relational database server enforces the schema of the
data when it is written to the table. If the data does not match the defined
schema, it will not be inserted into the table.
Because Hive doesn't control the data and can't enforce that it is written in
a specific format, it uses a different approach. It applies the schema when
the data is read out of the data storage: schema on read . As mentioned, if
the number of columns in the file is less than what is defined in Hive, null
values are returned for the missing columns. If the data types don't match,
null values are returned for those columns as well. The benefit of this is that
Hive queries rarely fail due to bad data in the files. However, you do have to
ensure that the data coming back is still meaningful and doesn't contain so
many null values that it isn't useful.
Working with Hive
Like many Hadoop tools, Hive leverages a command-line interface (CLI)
for interaction with the service. Other tools are available, such as the Hive
Web Interface (HWI) and Beeswax, a user interface that is part of the Hue
UI for working with Hadoop. For the examples in this chapter, though, the
command line is used.
Search WWH ::




Custom Search