Effective Big Data ETL with SSIS, Pig, and Sqoop - Microsoft Big Data Solutions

Database Reference

In-Depth Information

argument indicates the relational table that will be populated from Hadoop.

Alternatively, you can use the --call argument to indicate that a stored

procedure should be called for each row of information found in the Hadoop

system.

If you do not specify the --call argument, by default Sqoop generates

an INSERT statement for each record found in the Hadoop directory. By

specifying the --update-key argument and indicating a key column or

columns, you can modify this behavior to generate UPDATE statements

rather than INSERT s. You can use the --update-mode argument to

indicate rows that don't already exist in the target table should be inserted,

and rows that do exist should be updated:

sqoop export --connect

"jdbc:sqlserver://Your_SqlServer;database=MsBigData;

Username=demoPassword=your)password;" --table

Customers

--export-dir /MsBigData/Customers

--update-key ID --update-mode allowinsert

Exports done using Sqoop commit to the target database every 10,000 rows.

This prevents excessive resources from being tied up on the database server

managing largetransactions. However,itdoesmeanthattheexportsarenot

atomic and that a failure during execution may leave a partial set of rows in

the target database.

The --m argument controls the amount of parallel activity, just as it does

withtheimport.Thesamewarnings andcaveatsapplytoitsusewithexport.

Particularly in the case of exports, because Sqoop does its operations on a

row-by-row basis, running a large number of parallel nodes can have a very

negative impact on the target database.

Sqoop is a useful tool for quickly moving data in and out of Hadoop,

particularly if it is a one-time operation or the performance is not

particularly important.

Search WWH ::

Custom Search

Home