Database Reference
In-Depth Information
single or small numbers of rows to Hadoop each time. Remember that, by
default, Hadoop uses a 64MB block size for files, and it functions best when
the file size exceeds the block size. If you need to process smaller numbers
of rows, consider storing them in a temporary table in SQL Server or a
temporary file and only writing them to Hadoop when the data size is large
enough to make it an efficient operation.
When writing data back to SQL Server, you generally want to make sure that
the data is aggregated. This will let you write a smaller amount of data to the
SQL Server environment. Another concern when writing data to SQL Server
is how much parallel write activity you want to allow. Depending on your
method of doing the transfer, you could enable writing to the SQL Server
from a large number of Hadoop nodes. SQL Server can handle parallel
clients inserting data, but having too many parallel streams of insert activity
can actually slow down the overall process. Finding the right amount of
parallelism can involve some tuning, and requires you to understand the
other workloads running on your SQL Server at the same time. Fortunately,
you can control the amount of parallelism when moving data to SQL Server
in a number of ways, as are covered for each technology.
Working with SSIS and Hive
SSIS doesn't currently support direct connectivity to Hadoop. However,
using Hive and Open Database Connectivity (ODBC), you can leverage data
in your Hadoop system from SSIS. This involves a few steps:
1. Making sure that Hive is configured properly
2. Verifying that you can access Hive from the computer running SSIS
3. Verifying that the data you want to access in Hadoop has a table defined
for it in Hive
After going through this setup, you gain the ability to query your Hadoop
data in SSIS (and other tools) as if it resides in a relational database. This
offers the lowest-friction approach to using your Hadoop data in SSIS.
Writing data to Hadoop from SSIS is a little more challenging. In Chapter
6, “Adding Structure with Hive,” we discussed that Hive only supports bulk
insert operations, in keeping with the Hadoop approach of “write-once”
large files. Unfortunately, Hive uses some nonstandard SQL to handle these
bulk inserts, and the available ODBC drivers don't fully support it.
Search WWH ::




Custom Search