Adding Structure with Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

NOTE

The default delimiters can be overridden when the table is created. This

is useful when you are dealing with text files that use different

delimiters, but are still formatted in a very similar way. The options for

that are shown in the section “Creating Tables” in this chapter.

Table 6.2 Hive Default Delimiters for Text Files

Delimiter Octal

Code

Description

\n

\012

New line character; this delimits rows in a text file.

^A

\001

Separates columns in each row.

^B

\002

Separates elements in an ARRAY , STRUCT , and key/

value pairs in a MAP .

^C

\003

Separates the key from the value in a MAP column.

What if one of the many text files that is accessed through a Hive table

uses a different value as a column delimiter? In that case, Hive won't be

able to parse the file accurately. The exact results will vary depending on

exactlyhowthetextfileisformatted,andhowtheHivetablewasconfigured.

However, it's likely that Hive will find less than the expected number of

columns in the text file. In this case, it will fill in the columns it finds values

for, and then output null values for any “missing” columns.

The same thing will happen if the data values in the files don't match

the data type defined on the Hive table. If a file contains alphanumeric

characters where Hive is expecting only numeric values, it will return null

values. This enables Hive to be resilient to data quality issues with the files

stored in Hadoop.

Some data, however, isn't stored as text. Binary file formats can be faster

and more efficient than text formats, as the data takes less space in the files.

Ifthedataisstoredinasmallernumberofbytes,moreofitcanbereadfrom

the disk in a single-read operation, and more of it can fit in memory. This

can improve performance, particularly in a big data system.

Search WWH ::

Custom Search

Home