Database Reference
In-Depth Information
to split up the data to process it in parallel, you must also provide some
additional information:
--query 'SELECT customer.*, sales.*
FROM customer
JOIN sales on (customer.id == sales.customerId)
WHERE $CONDITIONS'
--split-by customer.id
The WHERE $CONDITIONS portion of the query provides a placeholder for
the criteria Sqoop uses to split up processing. The --split-by argument
tells Sqoop which column to use when determining how to split up the data
from the input query. By default, if the import is referencing a table instead
of a query, the table's primary key is used as the split column.
The --m argument controls how many parallel activities are created by
Sqoop. The default value for this is 4. Setting it to 1, as in this example,
means that the Sqoop process will be single threaded.
WARNING
Although increasing the parallel activities can improve performance,
you must be careful not to increase it too much. Increasing the --m
argument past the number of nodes in your cluster will adversely
impact performance. Also, the more parallel activities, the higher the
load on the database server.
Finally, the --target-dir argument determines what folder the data will
be written into on the Hadoop system. You can control whether the new
data is added to an existing directory by using the --append argument.
And you can import using Hive rather than a directory by specifying the
--hive-import and --hive-table arguments:
sqoop import --connect
"jdbc:sqlserver://Your_SqlServer;database=MsBigData;
Username=demo;Password=your_password;"
--table Customers
Search WWH ::




Custom Search