Effective Big Data ETL with SSIS, Pig, and Sqoop - Microsoft Big Data Solutions

Database Reference

In-Depth Information

data from your choice of sources. Any transformations that need to be

applied to the data can be performed. As the last step of the data flow, the

data needs to be written to a file. The format of the file is determined by

what the Hive system expects. The easiest format to work with from SSIS

is a delimited format, with carriage return / line feeds delimiting rows, and

a column delimiter like a comma (,) or vertical bar (|) separating column

values. The SSIS Flat File Destination is designed to write these types of

files.

NOTE

The default Hive column delimiter for flat files is Ctrl-A (0x001).

Unfortunately, this isn't supported for use from SSIS. If at all possible,

use a column delimiter that SSIS supports. If you must use a

non-standard column delimiter, you will need to add a post-processing

step to your package to translate the column delimiters after the file is

produced.

NOTE

If Hive is expecting another format (see Chapter 6 for some of the

possibilities), you might need to implement a custom destination using

a script component. Although a full description of this is beyond the

scope of this chapter, a custom destination lets you fully control the

format of the file produced, so you can match anything that Hive is

expecting.

Once the file is produced, you can use a file system task to copy it to

a network location that is accessible to both your SSIS server and your

Hadoop cluster. The next step is to call the process to copy the file into the

HDFS. This is done through an Execute Process task. Assuming that you are

executing the Hadoop copy on a remote system using PsExec, you configure

the task with the following property settings. (You might need to adjust your

file locations):

Search WWH ::

Custom Search

Home