Database Reference
In-Depth Information
After generating the deserialization code and configuring the InputFormat , Sqoop
sends the job to the MapReduce cluster. Map tasks execute the queries and deserialize
rows from the ResultSet into instances of the generated class, which are either stored
directly in SequenceFile s or transformed into delimited text before being written to
HDFS.
Controlling the Import
Sqoop does not need to import an entire table at a time. For example, a subset of the
table's columns can be specified for import. Users can also specify a WHERE clause to in-
clude in queries via the --where argument, which bounds the rows of the table to im-
port. For example, if widgets 0 through 99,999 were imported last month, but this month
our vendor catalog included 1,000 new types of widget, an import could be configured
with the clause WHERE id >= 100000 ; this will start an import job to retrieve all the
new rows added to the source database since the previous import run. User-supplied
WHERE clauses are applied before task splitting is performed, and are pushed down into
the queries executed by each task.
For more control — to perform column transformations, for example — users can specify
a --query argument.
Imports and Consistency
When importing data to HDFS, it is important that you ensure access to a consistent snap-
shot of the source data. (Map tasks reading from a database in parallel are running in sep-
arate processes. Thus, they cannot share a single database transaction.) The best way to do
this is to ensure that any processes that update existing rows of a table are disabled during
the import.
Incremental Imports
It's common to run imports on a periodic basis so that the data in HDFS is kept synchron-
ized with the data stored in the database. To do this, there needs to be some way of identi-
fying the new data. Sqoop will import rows that have a column value (for the column spe-
cified with --check-column ) that is greater than some specified value (set via --
last-value ).
The value specified as --last-value can be a row ID that is strictly increasing, such
as an AUTO_INCREMENT primary key in MySQL. This is suitable for the case where
new rows are added to the database table, but existing rows are not updated. This mode is
called append mode, and is activated via --incremental append . Another option is
Search WWH ::




Custom Search