Database Reference
In-Depth Information
Pig Latin scripts define the flow of data through transformations and,
although simple to write, can result in complex and sophisticated
manipulation of data. So, even though Pig Latin is SQL-like syntactically,
it is more like a SQL Server Integration Services (SSIS) Data Flow task
in spirit. Pig Latin scripts can have multiple inputs, transformations, and
outputs. Pig has a large number of its own built-in functions, but you can
always either create your own or just “raid the piggybank”
( https://cwiki.apache.org/confluence/display/PIG/PiggyBank )
for
community-provided functions.
As previously mentioned, Pig provides its scalability by operating in a
distributed mode on a Hadoop cluster. However, Pig Latin programs can
also be run in a local mode. This does not use a Hadoop cluster; instead, the
processing takes place in a single local Java Virtual Machine (JVM). This is
certainly advantageous for iterative development and initial prototyping.
SQOOP
SQOOP is a top-level Apache project. However, I like to think of Apache
SQOOP as a glue project. It provides the vehicle to transfer data from the
relational, tabular world of structured data stores to Apache Hadoop (and
vice versa).
SQOOP is extensible to allow developers to create new connectors using
the SQOOP application programming interface (API). This is a core part
of SQOOP's architecture, enabling a plug-and-play framework for new
connectors.
SQOOPiscurrentlygoingthroughsomethingofare-imaginingprocess.Asa
result,therearenowtwoversionsofSQOOP.SQOOP1isaclientapplication
architecture that interacts directly with the Hadoop configurations and
databases. SQOOP 1 also experienced a number of challenges in its
development. SQOOP 2 aims to address the original design issues and starts
from a server-based architecture. These are discussed in more detail later in
this topic.
Historically, SQL Server had SQOOP connectors that were separate
downloads available from Microsoft. These have now been rolled into
SQOOP 1.4 and are also included into the HDInsight Service. SQL Server
Parallel Data Warehouse (PDW) has an alternative technology, Polybase,
which we discuss in more detail in Chapter 10, “Data Warehouses and
Hadoop Integration.”
Search WWH ::




Custom Search