Database Reference
In-Depth Information
on the local filesystem. Data is then streamed into mysqlimport via the FIFO channel,
and from there into the database.
Whereas most MapReduce jobs reading from HDFS pick the degree of parallelism (num-
ber of map tasks) based on the number and size of the files to process, Sqoop's export sys-
tem allows users explicit control over the number of tasks. The performance of the export
can be affected by the number of parallel writers to the database, so Sqoop uses the Com-
bineFileInputFormat class to group the input files into a smaller number of map
tasks.
Exports and Transactionality
Due to the parallel nature of the process, often an export is not an atomic operation. Sqoop
will spawn multiple tasks to export slices of the data in parallel. These tasks can complete
at different times, meaning that even though transactions are used inside tasks, results
from one task may be visible before the results of another task. Moreover, databases often
use fixed-size buffers to store transactions. As a result, one transaction cannot necessarily
contain the entire set of operations performed by a task. Sqoop commits results every few
thousand rows, to ensure that it does not run out of memory. These intermediate results
are visible while the export continues. Applications that will use the results of an export
should not be started until the export process is complete, or they may see partial results.
To solve this problem, Sqoop can export to a temporary staging table and then, at the end
of the job — if the export has succeeded — move the staged data into the destination table
in a single transaction. You can specify a staging table with the --staging-table op-
tion. The staging table must already exist and have the same schema as the destination. It
must also be empty, unless the --clear-staging-table option is also supplied.
NOTE
Using a staging table is slower, since the data must be written twice: first to the staging table, then to the
destination table. The export process also uses more space while it is running, since there are two copies
of the data while the staged data is being copied to the destination.
Exports and SequenceFiles
The example export reads source data from a Hive table, which is stored in HDFS as a de-
limited text file. Sqoop can also export delimited text files that were not Hive tables. For
example, it can export text files that are the output of a MapReduce job.
Search WWH ::




Custom Search