Database Reference
In-Depth Information
public void readFields ( ResultSet __dbResults ) throws SQLException ;
public void write ( PreparedStatement __dbStmt ) throws SQLException ;
JDBC's ResultSet interface provides a cursor that retrieves records from a query; the
readFields() method here will populate the fields of the Widget object with the
columns from one row of the ResultSet 's data. The write() method shown here al-
lows Sqoop to insert new Widget rows into a table, a process called exporting . Exports
are discussed in Performing an Export .
The MapReduce job launched by Sqoop uses an InputFormat that can read sections of
a table from a database via JDBC. The DataDrivenDBInputFormat provided with
Hadoop partitions a query's results over several map tasks.
Reading a table is typically done with a simple query such as:
SELECT col1 , col2 , col3 ,... FROM tableName
But often, better import performance can be gained by dividing this query across multiple
nodes. This is done using a splitting column . Using metadata about the table, Sqoop will
guess a good column to use for splitting the table (typically the primary key for the table,
if one exists). The minimum and maximum values for the primary key column are re-
trieved, and then these are used in conjunction with a target number of tasks to determine
the queries that each map task should issue.
For example, suppose the widgets table had 100,000 entries, with the id column con-
taining values 0 through 99,999. When importing this table, Sqoop would determine that
id is the primary key column for the table. When starting the MapReduce job, the
DataDrivenDBInputFormat used to perform the import would issue a statement
such as SELECT MIN(id), MAX(id) FROM widgets . These values would then
be used to interpolate over the entire range of data. Assuming we specified that five map
tasks should run in parallel (with -m 5 ), this would result in each map task executing
queries such as SELECT id, widget_name, ... FROM widgets WHERE id
>= 0 AND id < 20000 , SELECT id, widget_name, ... FROM widgets
WHERE id >= 20000 AND id < 40000 , and so on.
The choice of splitting column is essential to parallelizing work efficiently. If the id
column were not uniformly distributed (perhaps there are no widgets with IDs between
50,000 and 75,000), then some map tasks might have little or no work to perform, where-
as others would have a great deal. Users can specify a particular splitting column when
running an import job (via the --split-by argument), to tune the job to the data's actu-
al distribution. If an import job is run as a single (sequential) task with -m 1 , this split
process is not performed.
Search WWH ::




Custom Search