Database Reference
In-Depth Information
If your SSIS environment has the Hadoop tools loaded on it, loading data
into Hadoop can be as simple as calling the dfs -put command from an
Execute Process task:
hadoop dfs -put
\\HDP1-3\LandingZone\MsBigData\Customer1.txt
/user/MsBigData/Customer1.txt
This moves the file from the local file system to the distributed file system.
However, it can be a little more complex if you do not have a Hadoop
installation on your SSIS server. In this case, you need a way to execute the
dfs -put command on the remote server.
Fortunately, several tools enable you to execute remote processes. The
appropriate tool depends on what operating system is running on your
Hadoop cluster. If you are using Linux, you can use the SSH shell
application to execute the remote process. To run this from your SSIS
package, you can install a tool called puTTy on your SSIS server. This tool
enablesyoutorunSSHcommandsontheremotecomputerfromanExecute
Process task.
If your Hadoop environment is hosted on a Windows platform, using the
Hortonworks distribution, you can use PsExec, a tool from Microsoft that
enables you to execute remote processes on other servers. To use this in
SSIS, you call it from an Execute Process task.
NOTE
Security issues with PsExec are one of the more common challenges
when using it. Make sure that the command line you are sending to
PsExec is valid by testing it on the target computer first. Then ensure
the user account you are running the PsExec command under has
permissions to run the executable on the remote computer. One easy
way to do this is to log in to the target computer as the specified user
and run the executable. Finally, ensure that the account running the
package matches the account you tested with.
Setting up a package to implement this process is relatively straightforward.
You set up a data flow task as normal, with a source component retrieving
Search WWH ::




Custom Search