Sqoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

on the local filesystem. Data is then streamed into mysqlimport via the FIFO channel,

and from there into the database.

Whereas most MapReduce jobs reading from HDFS pick the degree of parallelism (num-

ber of map tasks) based on the number and size of the files to process, Sqoop's export sys-

tem allows users explicit control over the number of tasks. The performance of the export

can be affected by the number of parallel writers to the database, so Sqoop uses the Com-

bineFileInputFormat class to group the input files into a smaller number of map

tasks.

Exports and Transactionality

Due to the parallel nature of the process, often an export is not an atomic operation. Sqoop

will spawn multiple tasks to export slices of the data in parallel. These tasks can complete

at different times, meaning that even though transactions are used inside tasks, results

from one task may be visible before the results of another task. Moreover, databases often

use fixed-size buffers to store transactions. As a result, one transaction cannot necessarily

contain the entire set of operations performed by a task. Sqoop commits results every few

thousand rows, to ensure that it does not run out of memory. These intermediate results

are visible while the export continues. Applications that will use the results of an export

should not be started until the export process is complete, or they may see partial results.

To solve this problem, Sqoop can export to a temporary staging table and then, at the end

of the job — if the export has succeeded — move the staged data into the destination table

in a single transaction. You can specify a staging table with the --staging-table op-

tion. The staging table must already exist and have the same schema as the destination. It

must also be empty, unless the --clear-staging-table option is also supplied.

NOTE

Using a staging table is slower, since the data must be written twice: first to the staging table, then to the

destination table. The export process also uses more space while it is running, since there are two copies

of the data while the staged data is being copied to the destination.

Exports and SequenceFiles

The example export reads source data from a Hive table, which is stored in HDFS as a de-

limited text file. Sqoop can also export delimited text files that were not Hive tables. For

example, it can export text files that are the output of a MapReduce job.

Search WWH ::

Custom Search

Home