Effective Big Data ETL with SSIS, Pig, and Sqoop - Microsoft Big Data Solutions

Database Reference

In-Depth Information

single or small numbers of rows to Hadoop each time. Remember that, by

default, Hadoop uses a 64MB block size for files, and it functions best when

the file size exceeds the block size. If you need to process smaller numbers

of rows, consider storing them in a temporary table in SQL Server or a

temporary file and only writing them to Hadoop when the data size is large

enough to make it an efficient operation.

When writing data back to SQL Server, you generally want to make sure that

the data is aggregated. This will let you write a smaller amount of data to the

SQL Server environment. Another concern when writing data to SQL Server

is how much parallel write activity you want to allow. Depending on your

method of doing the transfer, you could enable writing to the SQL Server

from a large number of Hadoop nodes. SQL Server can handle parallel

clients inserting data, but having too many parallel streams of insert activity

can actually slow down the overall process. Finding the right amount of

parallelism can involve some tuning, and requires you to understand the

other workloads running on your SQL Server at the same time. Fortunately,

you can control the amount of parallelism when moving data to SQL Server

in a number of ways, as are covered for each technology.

Working with SSIS and Hive

SSIS doesn't currently support direct connectivity to Hadoop. However,

using Hive and Open Database Connectivity (ODBC), you can leverage data

in your Hadoop system from SSIS. This involves a few steps:

1. Making sure that Hive is configured properly

2. Verifying that you can access Hive from the computer running SSIS

3. Verifying that the data you want to access in Hadoop has a table defined

for it in Hive

After going through this setup, you gain the ability to query your Hadoop

data in SSIS (and other tools) as if it resides in a relational database. This

offers the lowest-friction approach to using your Hadoop data in SSIS.

Writing data to Hadoop from SSIS is a little more challenging. In Chapter

6, “Adding Structure with Hive,” we discussed that Hive only supports bulk

insert operations, in keeping with the Hadoop approach of “write-once”

large files. Unfortunately, Hive uses some nonstandard SQL to handle these

bulk inserts, and the available ODBC drivers don't fully support it.

Search WWH ::

Custom Search

Home