Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Comparison with Databases

Having seen Pig in action, it might seem that Pig Latin is similar to SQL. The presence of

such operators as GROUP BY and DESCRIBE reinforces this impression. However, there

are several differences between the two languages, and between Pig and relational database

management systems (RDBMSs) in general.

The most significant difference is that Pig Latin is a data flow programming language,

whereas SQL is a declarative programming language. In other words, a Pig Latin program

is a step-by-step set of operations on an input relation, in which each step is a single trans-

formation. By contrast, SQL statements are a set of constraints that, taken together, define

the output. In many ways, programming in Pig Latin is like working at the level of an

RDBMS query planner, which figures out how to turn a declarative statement into a system

of steps.

RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about

the data that it processes: you can define a schema at runtime, but it's optional. Essentially,

it will operate on any source of tuples (although the source should support being read in

parallel, by being in multiple files, for example), where a UDF is used to read the tuples

from their raw representation. [ 97 ] The most common representation is a text file with tab-

separated fields, and Pig provides a built-in load function for this format. Unlike with a tra-

ditional database, there is no data import process to load the data into the RDBMS. The

data is loaded from the filesystem (usually HDFS) as the first step in the processing.

Pig's support for complex, nested data structures further differentiates it from SQL, which

operates on flatter data structures. Also, Pig's ability to use UDFs and streaming operators

that are tightly integrated with the language and Pig's nested data structures makes Pig Lat-

in more customizable than most SQL dialects.

RDBMSs have several features to support online, low-latency queries, such as transactions

and indexes, that are absent in Pig. Pig does not support random reads or queries on the or-

der of tens of milliseconds. Nor does it support random writes to update small portions of

data; all writes are bulk streaming writes, just like with MapReduce.

Hive (covered in Chapter 17 ) sits between Pig and conventional RDBMSs. Like Pig, Hive

is designed to use HDFS for storage, but otherwise there are some significant differences.

Its query language, HiveQL, is based on SQL, and anyone who is familiar with SQL will

have little trouble writing queries in HiveQL. Like RDBMSs, Hive mandates that all data

be stored in tables, with a schema under its management; however, it can associate a

schema with preexisting data in HDFS, so the load step is optional. Pig is able to work with

Hive tables using HCatalog; this is discussed further in Using Hive tables with HCatalog .

Search WWH ::

Custom Search

Home