Adding Structure with Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Hive, however, uses Hadoop as its data storage system. Therefore, the data

sits in HDFS and is accessible to anyone with access to the file system. This

does make it easier to manage the data and add new information, but you

must be aware that other processes can manipulate the data.

One of the primary differences between Hive and most relational systems

is that data in Hive can only be selected, inserted, or deleted; there is no

update capability. This is due to Hive using Hadoop file storage for its

data. As noted in Chapter 8, “Effective Big Data ETL with SSIS, Pig, and

Sqoop,” Hadoop is a write-once, read-many file system. If you need to

change something in a file, you delete the original and write a new version

of the file. Because Hive manages table data using Hadoop, the same

constraints apply to Hive. There are also no row-based operations. Instead,

everything is done in bulk mode.

Another key difference is that the data structure is defined up-front in

traditional relational databases. The columns of a table, their data types,

and any constraints on what the column can hold are set when the table

is created. The database server enforces that any data written to the table

conformstotherulessetupwhenthetablewascreated.Thisisreferredtoas

schema on write ; the relational database server enforces the schema of the

data when it is written to the table. If the data does not match the defined

schema, it will not be inserted into the table.

Because Hive doesn't control the data and can't enforce that it is written in

a specific format, it uses a different approach. It applies the schema when

the data is read out of the data storage: schema on read . As mentioned, if

the number of columns in the file is less than what is defined in Hive, null

values are returned for the missing columns. If the data types don't match,

null values are returned for those columns as well. The benefit of this is that

Hive queries rarely fail due to bad data in the files. However, you do have to

ensure that the data coming back is still meaningful and doesn't contain so

many null values that it isn't useful.

Working with Hive

Like many Hadoop tools, Hive leverages a command-line interface (CLI)

for interaction with the service. Other tools are available, such as the Hive

Web Interface (HWI) and Beeswax, a user interface that is part of the Hue

UI for working with Hadoop. For the examples in this chapter, though, the

command line is used.

Search WWH ::

Custom Search

Home