Databases Reference
In-Depth Information
6.8. Using the Correct NULL String in Hive
Problem
You've imported your data into Hive using Sqoop and now you're trying to query it.
However, you can see that some columns contain the correct
NULL
value but some con‐
tain the string literal
null
and are not selected using the expression
column IS NULL
.
Solution
Due to differences in the default
NULL
substitution string between Sqoop and Hive, you
have to override the Sqoop default substitution strings to be compatible with Hive. For
example:
sqoop import
\
--connect jdbc:mysql://mysql.example.com/sqoop
\
--username sqoop
\
--password sqoop
\
--table cities
\
--hive-import
\
--null-string
'\\N'
\
--null-non-string
'\\N'
Discussion
Hive, by default, expects that the
NULL
value will be encoded using the string constant
\N
. Sqoop, by default, encodes it using the string constant
null
. To rectify the mismatch,
you'll need to override Sqoop's default behavior with Hive's.
This issue with different
NULL
substitution strings is very tricky to debug. Depending
on the way Hive parses data, it seems to be working sometimes and at other times does
not. When Hive is not able to parse certain columns, it will return
NULL
instead of
throwing an exception and failing the entire query execution. Let's investigate this a bit
further with an example. Consider the following table:
CREATE
TABLE
tbl
(
id
int
,
txt
varchar
(
50
));
INSERT
INTO
tbl
VALUES
(
NULL
,
NULL
);
The table
tbl
has two columns, one numeric and one text, with a row that has
NULL
stored in both columns. Without any special parameters, Sqoop will import this table
as a file with a single row
null,null
.
When Hive reads this line, it will first separate each column using the comma as a
separator. Subsequently, it will start processing the value for each column. The first
column is of type
int
and contains the string constant
null
, which is not a valid number
value. Instead of throwing a parsing exception at this point, Hive will substitute
NULL
for this cell. The second column is of type
string
and contains the string constant