Database Reference
In-Depth Information
Comparison with Databases
Having seen Pig in action, it might seem that Pig Latin is similar to SQL. The presence of
such operators as GROUP BY and DESCRIBE reinforces this impression. However, there
are several differences between the two languages, and between Pig and relational database
management systems (RDBMSs) in general.
The most significant difference is that Pig Latin is a data flow programming language,
whereas SQL is a declarative programming language. In other words, a Pig Latin program
is a step-by-step set of operations on an input relation, in which each step is a single trans-
formation. By contrast, SQL statements are a set of constraints that, taken together, define
the output. In many ways, programming in Pig Latin is like working at the level of an
RDBMS query planner, which figures out how to turn a declarative statement into a system
of steps.
RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about
the data that it processes: you can define a schema at runtime, but it's optional. Essentially,
it will operate on any source of tuples (although the source should support being read in
parallel, by being in multiple files, for example), where a UDF is used to read the tuples
from their raw representation. [ 97 ] The most common representation is a text file with tab-
separated fields, and Pig provides a built-in load function for this format. Unlike with a tra-
ditional database, there is no data import process to load the data into the RDBMS. The
data is loaded from the filesystem (usually HDFS) as the first step in the processing.
Pig's support for complex, nested data structures further differentiates it from SQL, which
operates on flatter data structures. Also, Pig's ability to use UDFs and streaming operators
that are tightly integrated with the language and Pig's nested data structures makes Pig Lat-
in more customizable than most SQL dialects.
RDBMSs have several features to support online, low-latency queries, such as transactions
and indexes, that are absent in Pig. Pig does not support random reads or queries on the or-
der of tens of milliseconds. Nor does it support random writes to update small portions of
data; all writes are bulk streaming writes, just like with MapReduce.
Hive (covered in Chapter 17 ) sits between Pig and conventional RDBMSs. Like Pig, Hive
is designed to use HDFS for storage, but otherwise there are some significant differences.
Its query language, HiveQL, is based on SQL, and anyone who is familiar with SQL will
have little trouble writing queries in HiveQL. Like RDBMSs, Hive mandates that all data
be stored in tables, with a schema under its management; however, it can associate a
schema with preexisting data in HDFS, so the load step is optional. Pig is able to work with
Hive tables using HCatalog; this is discussed further in Using Hive tables with HCatalog .
Search WWH ::




Custom Search