Database Reference
In-Depth Information
Chapter 10
Integrating HDInsight with SQL Server
Integration Services
Microsoft SQL Server is a complete suite of tools that include a relational database management system (RDBMS),
multidimensional online analytical processing (OLAP) and tabular database engines, a brokering service, a
scheduling service (SQL Agent), and many other features. As discussed in Chapter 1, it has become extremely
important these days to integrate data between different sources. The advantage that SQL Server brings is that it offers
a powerful Business Intelligence (BI) stack, which provides rich features for data mining and interactive reporting.
One of these BI components is an Extract, Transform, and Load (ETL) tool called SQL Server Integration Services
(SSIS) . ETL is a process to extract data, mostly from different types of systems, transform it into a structure that's more
appropriate for reporting and analysis and finally load it into the database. SSIS, as an ETL tool offers the ability to
merge structured and unstructured data by importing Hive data into SQL Server and apply powerful analytics on the
integrated data. Throughout the rest of this chapter, you will get a basic lesson on how SSIS works and create a simple
SSIS package to import data from Hive to SQL Server.
SSIS as an ETL Tool
The primary objective of an ETL tool is to be able to import and export data to and from heterogeneous data sources.
This includes the ability to connect to external systems, as well as to transform or clean the data while moving the
data between the external systems and the databases. SSIS can be used to import data to and from SQL Server. It can
even be used to move data between external non-SQL systems without requiring SQL Server to be the source or the
destination. For instance, SSIS can be used to move data from an FTP server to a local flat file.
SSIS also provides a workflow engine to automate various tasks (data flows, task executions, and so forth) that
are executed in an ETL job. An SSIS package execution can itself be one step that is part of an SQL Agent job, and SQL
Agent can run multiple jobs independent of each other.
An SSIS solution consists of one package or more, each containing a control flow to perform a sequence of
tasks. Tasks in a control flow can include calls to web services, FTP operations, file-system tasks, the automation of
command-line commands, and others. In particular, a control flow usually includes one or more data-flow tasks,
which encapsulate an in-memory, buffer-based pipeline of data from a source to a destination, with transformations
applied to the data as it flows through the pipeline. An SSIS package has one control flow and as many data flows as
necessary. Data-flow execution is dictated by the content of the control flow.
Detailed discussion on SSIS and its components are outside the scope of this topic. In this chapter, I assume you
are familiar with basic SSIS package development using Business Intelligence Development Studio (BIDS, in SQL
Server 2005/2008/2008 R2) or SQL Server Data Tools (in SQL Server 2012). If you are a beginner in SSIS, I recommend
that you read one of the many good, introductory SSIS topics available as a pre-requisite. In the rest of this chapter, we
will focus on how to consume Hive data from SSIS using the Hive Open Database Connectivity (ODBC) driver.
 
Search WWH ::




Custom Search