Industry Needs and Solutions - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Pig Latin scripts define the flow of data through transformations and,

although simple to write, can result in complex and sophisticated

manipulation of data. So, even though Pig Latin is SQL-like syntactically,

it is more like a SQL Server Integration Services (SSIS) Data Flow task

in spirit. Pig Latin scripts can have multiple inputs, transformations, and

outputs. Pig has a large number of its own built-in functions, but you can

always either create your own or just “raid the piggybank”

for

community-provided functions.

As previously mentioned, Pig provides its scalability by operating in a

distributed mode on a Hadoop cluster. However, Pig Latin programs can

also be run in a local mode. This does not use a Hadoop cluster; instead, the

processing takes place in a single local Java Virtual Machine (JVM). This is

certainly advantageous for iterative development and initial prototyping.

SQOOP

SQOOP is a top-level Apache project. However, I like to think of Apache

SQOOP as a glue project. It provides the vehicle to transfer data from the

relational, tabular world of structured data stores to Apache Hadoop (and

vice versa).

SQOOP is extensible to allow developers to create new connectors using

the SQOOP application programming interface (API). This is a core part

of SQOOP's architecture, enabling a plug-and-play framework for new

connectors.

SQOOPiscurrentlygoingthroughsomethingofare-imaginingprocess.Asa

result,therearenowtwoversionsofSQOOP.SQOOP1isaclientapplication

architecture that interacts directly with the Hadoop configurations and

databases. SQOOP 1 also experienced a number of challenges in its

development. SQOOP 2 aims to address the original design issues and starts

from a server-based architecture. These are discussed in more detail later in

this topic.

Historically, SQL Server had SQOOP connectors that were separate

downloads available from Microsoft. These have now been rolled into

SQOOP 1.4 and are also included into the HDInsight Service. SQL Server

Parallel Data Warehouse (PDW) has an alternative technology, Polybase,

which we discuss in more detail in Chapter 10, “Data Warehouses and

Hadoop Integration.”

Search WWH ::

Custom Search

Home