Effective Big Data ETL with SSIS, Pig, and Sqoop - Microsoft Big Data Solutions

Database Reference

In-Depth Information

to split up the data to process it in parallel, you must also provide some

additional information:

--query 'SELECT customer.*, sales.*

FROM customer

JOIN sales on (customer.id == sales.customerId)

WHERE $CONDITIONS'

--split-by customer.id

The WHERE $CONDITIONS portion of the query provides a placeholder for

the criteria Sqoop uses to split up processing. The --split-by argument

tells Sqoop which column to use when determining how to split up the data

from the input query. By default, if the import is referencing a table instead

of a query, the table's primary key is used as the split column.

The --m argument controls how many parallel activities are created by

Sqoop. The default value for this is 4. Setting it to 1, as in this example,

means that the Sqoop process will be single threaded.

WARNING

Although increasing the parallel activities can improve performance,

you must be careful not to increase it too much. Increasing the --m

argument past the number of nodes in your cluster will adversely

impact performance. Also, the more parallel activities, the higher the

load on the database server.

Finally, the --target-dir argument determines what folder the data will

be written into on the Hadoop system. You can control whether the new

data is added to an existing directory by using the --append argument.

And you can import using Hive rather than a directory by specifying the

--hive-import and --hive-table arguments:

sqoop import --connect

"jdbc:sqlserver://Your_SqlServer;database=MsBigData;

Username=demo;Password=your_password;"

--table Customers

Search WWH ::

Custom Search

Home