Cloudware Application Development - Guide to Cloud Computing for Business and Technology Managers

Information Technology Reference

In-Depth Information

Apache Software Foundation. Pig is widely used both inside and outside

Yahoo for a wide range of tasks including ad hoc data analytics, ETL tasks,

log processing, and training collaborative filtering models for recommenda-

tion systems.

The fundamental goals of designing Pig were as follows:

• Programming flexibility: : The ability to break down complex tasks

comprised of multiple steps and interprocess-related data transfor-

mations should be encoded as dataflow sequences that are easy to

design, develop, and maintain.

• Automatic optimization : Tasks are encoded to let the system optimize

their execution automatically. This allows the user to have greater

focus on program development, allowing the user to focus on

semantics rather than efficiency.

• Extensibility : Users can develop user-defined functions (UDFs) for

more complex processing requirements.

Pig queries are expressed in a declarative scripting language called Pig

Latin, which provides SQL-like functionality tailored toward big data's

specific needs. Most notably from the syntax point of view, Pig Latin

enforces implicit specification of the dataflow as a sequence of expressions

chained together through the use of variables. This style of programming

is different from SQL, where the order of computation is not reflected at the

language level, and is better suited to the ad hoc nature of Pig as it makes

query development and maintenance easier due to the increased readabil-

ity of the code.

Unlike traditional SQL systems, the data do not have to be stored in a

system-specific format before it can be used by a query. Instead, the input

and output formats are specified through storage functions inside the load

and store expressions. In addition to ASCII and binary storage, users can

implement their own storage functions to add support for other custom

formats. Pig uses a dynamic type system to provide native support for

nonnormalized data models. In addition to the simple data types used

by relational databases, Pig defines three complex types—tuple, bag, and

map—which can be nested arbitrary to reflect the semistructured nature

of the processed data. For better support of ad hoc queries, Pig does not

maintain a catalog with schema information about the source data. Instead,

input schema is defined at the query level either explicitly by the user or

implicitly through type inference. At the top level, all input sources are

treated as bags of tuples; the tuple schema can be optionally supplied as

part of the load expression.

Search WWH ::

Custom Search

Home