Database Reference
In-Depth Information
time-based incremental imports (specified by --incremental lastmodified ),
which is appropriate when existing rows may be updated, and there is a column (the check
column) that records the last modified time of the update.
At the end of an incremental import, Sqoop will print out the value to be specified as --
last-value on the next import. This is useful when running incremental imports
manually, but for running periodic imports it is better to use Sqoop's saved job facility,
which automatically stores the last value and uses it on the next job run. Type sqoop
job --help for usage instructions for saved jobs.
Direct-Mode Imports
Sqoop's architecture allows it to choose from multiple available strategies for performing
an import. Most databases will use the DataDrivenDBInputFormat -based approach
described earlier. Some databases, however, offer specific tools designed to extract data
quickly. For example, MySQL's mysqldump application can read from a table with
greater throughput than a JDBC channel. The use of these external tools is referred to as
direct mode in Sqoop's documentation. Direct mode must be specifically enabled by the
user (via the --direct argument), as it is not as general purpose as the JDBC approach.
(For example, MySQL's direct mode cannot handle large objects, such as CLOB or BLOB
columns, and that's why Sqoop needs to use a JDBC-specific API to load these columns
into HDFS.)
For databases that provide such tools, Sqoop can use these to great effect. A direct-mode
import from MySQL is usually much more efficient (in terms of map tasks and time re-
quired) than a comparable JDBC-based import. Sqoop will still launch multiple map tasks
in parallel. These tasks will then spawn instances of the mysqldump program and read
its output. Sqoop can also perform direct-mode imports from PostgreSQL, Oracle, and
Netezza.
Even when direct mode is used to access the contents of a database, the metadata is still
queried through JDBC.
Search WWH ::




Custom Search