Database Reference
In-Depth Information
• Executable: C:\Sysinternals\PsExec.exe
• Arguments:
\\Your_Hadoop_ServerC:\hdp\hadoop\hadoop-1.2.0.1.3.0.0-0380\bin\hadoop.cmd
dfs -put
\\CommonNetworkLocation\LandingZone\Customer1.txt
/user/MsBigData/Customer1.txt
The Execute Process task can be configured to use expressions to make this
process more dynamic. In addition, if you are moving multiple files, it can
be used inside a For Each loop in SSIS to repeat the process a specified
number of times.
Getting the Best Performance from SSIS
As touched on earlier, one way to improve SSIS performance with big data
is to minimize the amount of data that SSIS actually has to process. When
querying from Hive, always minimize the number of rows and columns you
are retrieving to the essential ones.
Another way of improving performance in SSIS is by increasing the parallel
activity. This has the most benefit when you are writing to Hadoop. If you
set up multiple, parallel data flows, all producing data files, you can invoke
multiple dfs -put commands simultaneously to move the data files into
the Hadoop file system. This takes advantage of the Hadoop capability to
scale out across multiple nodes.
Increasing parallelism for packages reading from Hive can have mixed
results. You get a certain amount of parallelism when you query from Hive
in the first place because it spreads the processing out across the cluster.
You can attempt to run multiple queries using different ODBC source
components in SSIS simultaneously, but generally it works better to issue a
single query and let Hive determine how much parallelism to use.
SSIS is a good way to interact with Hadoop, particularly for querying
information.It'salsoafamiliartooltothoseintheSQLServerspace.Thanks
to the number of sources and destinations it supports, it can prove very
useful when integrating your big data with the rest of your organization.
Search WWH ::




Custom Search