Information Technology Reference
In-Depth Information
Apache Software Foundation. Pig is widely used both inside and outside
Yahoo for a wide range of tasks including ad hoc data analytics, ETL tasks,
log processing, and training collaborative filtering models for recommenda-
tion systems.
The fundamental goals of designing Pig were as follows:
Programming flexibility: : The ability to break down complex tasks
comprised of multiple steps and interprocess-related data transfor-
mations should be encoded as dataflow sequences that are easy to
design, develop, and maintain.
Automatic optimization : Tasks are encoded to let the system optimize
their execution automatically. This allows the user to have greater
focus on program development, allowing the user to focus on
semantics rather than efficiency.
Extensibility : Users can develop user-defined functions (UDFs) for
more complex processing requirements.
Pig queries are expressed in a declarative scripting language called Pig
Latin, which provides SQL-like functionality tailored toward big data's
specific needs. Most notably from the syntax point of view, Pig Latin
enforces implicit specification of the dataflow as a sequence of expressions
chained together through the use of variables. This style of programming
is different from SQL, where the order of computation is not reflected at the
language level, and is better suited to the ad hoc nature of Pig as it makes
query development and maintenance easier due to the increased readabil-
ity of the code.
Unlike traditional SQL systems, the data do not have to be stored in a
system-specific format before it can be used by a query. Instead, the input
and output formats are specified through storage functions inside the load
and store expressions. In addition to ASCII and binary storage, users can
implement their own storage functions to add support for other custom
formats. Pig uses a dynamic type system to provide native support for
nonnormalized data models. In addition to the simple data types used
by relational databases, Pig defines three complex types—tuple, bag, and
map—which can be nested arbitrary to reflect the semistructured nature
of the processed data. For better support of ad hoc queries, Pig does not
maintain a catalog with schema information about the source data. Instead,
input schema is defined at the query level either explicitly by the user or
implicitly through type inference. At the top level, all input sources are
treated as bags of tuples; the tuple schema can be optionally supplied as
part of the load expression.
Search WWH ::




Custom Search