Databases Reference
In-Depth Information
2.10. Encoding NULL Values
Problem
Sqoop encodes database
NULL
values using the
null
string constant. Your downstream
processing (Hive queries, custom MapReduce job, or Pig script) uses a different constant
for encoding missing values. You would like to override the default one.
Solution
You can override the
NULL
substitution string with the
--null-string
and
--null-non-
string
parameters to any arbitrary value. For example, use the following command to
override it to
\N
:
sqoop import
\
--connect jdbc:mysql://mysql.example.com/sqoop
\
--username sqoop
\
--password sqoop
\
--table cities
\
--null-string
'\\N'
\
--null-non-string
'\\N'
Discussion
Almost all relational databases allow you to define columns as optional, allowing the
columns to be associated with no value. This missing information is usually referred to
as the
NULL
value. For example, a
BOOL
column generally contains only two distinct
values:
true
and
false
. Additionally, the column can contain the value
NULL
if during
table creation you've explicitly allowed that column to be optional. Databases usually
store the
NULL
value as an extra bit in addition to the column's usual data. With Sqoop
supporting the import of data to formats that don't natively support the
NULL
value (e.g.,
CSV file), there is a need to encode the missing value into the data itself. By default,
Sqoop uses the string constant
null
(lowercased) for representing the missing value.
This default constant might not suit your needs if your data can contain the same string
constant as a regular value or if your downstream processing is expecting a different
substitution string constant.
To allow easier integration with additional Hadoop ecosystem components, Sqoop dis‐
tinguishes between two different cases when dealing with missing values. For text-based
columns that are defined with type
VARCHAR
,
CHAR
,
NCHAR
,
TEXT
, and a few others, you
can override the default substitution string using the parameter
--null-string
. For all
other column types, you can override the substitution string with the
--null-non-
string
parameter. Some of the connectors might not support different substitution
strings for different column types and thus might require you to specify the same value
in both parameters.