Database Reference
In-Depth Information
public
void
readFields
(
ResultSet __dbResults
)
throws
SQLException
;
public
void
write
(
PreparedStatement __dbStmt
)
throws
SQLException
;
JDBC's
ResultSet
interface provides a cursor that retrieves records from a query; the
readFields()
method here will populate the fields of the
Widget
object with the
columns from one row of the
ResultSet
's data. The
write()
method shown here al-
lows Sqoop to insert new
Widget
rows into a table, a process called
exporting
. Exports
are discussed in
Performing an Export
.
The MapReduce job launched by Sqoop uses an
InputFormat
that can read sections of
a table from a database via JDBC. The
DataDrivenDBInputFormat
provided with
Hadoop partitions a query's results over several map tasks.
Reading a table is typically done with a simple query such as:
SELECT
col1
,
col2
,
col3
,... FROM
tableName
But often, better import performance can be gained by dividing this query across multiple
nodes. This is done using a
splitting column
. Using metadata about the table, Sqoop will
guess a good column to use for splitting the table (typically the primary key for the table,
if one exists). The minimum and maximum values for the primary key column are re-
trieved, and then these are used in conjunction with a target number of tasks to determine
the queries that each map task should issue.
For example, suppose the
widgets
table had 100,000 entries, with the
id
column con-
taining values 0 through 99,999. When importing this table, Sqoop would determine that
id
is the primary key column for the table. When starting the MapReduce job, the
DataDrivenDBInputFormat
used to perform the import would issue a statement
such as
SELECT MIN(id), MAX(id) FROM widgets
. These values would then
be used to interpolate over the entire range of data. Assuming we specified that five map
tasks should run in parallel (with
-m 5
), this would result in each map task executing
queries such as
SELECT id, widget_name, ... FROM widgets WHERE id
>= 0 AND id < 20000
,
SELECT id, widget_name, ... FROM widgets
WHERE id >= 20000 AND id < 40000
, and so on.
The choice of splitting column is essential to parallelizing work efficiently. If the
id
column were not uniformly distributed (perhaps there are no widgets with IDs between
50,000 and 75,000), then some map tasks might have little or no work to perform, where-
as others would have a great deal. Users can specify a particular splitting column when
running an import job (via the
--split-by
argument), to tune the job to the data's actu-
al distribution. If an import job is run as a single (sequential) task with
-m 1
, this split
process is not performed.