Database Reference
In-Depth Information
Using Pig for Data Movement
Pig was originally developed for much the same reasons as Hive. Users
needed a way to work with MapReduce without becoming Java developers.
Pig solves that problem by providing a language, Pig Latin, that is easy
to understand and allows the developer to express the intent of the data
transformation, instead of having to code each step explicitly.
Another major benefit of Pig is its ability to scale, so that large data
transformation processes can be run across many nodes. This makes
processing large data sets much more feasible. Because Pig uses MapReduce
under the covers, it benefits from MapReduce's ability to scale across the
nodes in your Hadoop cluster.
Pig does come with some downsides. It cannot natively write to other data
stores, so it is primarily useful for transforming data inside the Hadoop
ecosystem. Also, because there is some overhead in preparing and executing
the MapReduce jobs, it's not an ideal choice for data transformations that
are transactional in nature. Instead, it does best when processing large
amounts of data in batch operations.
Transforming Data with Pig
Pig can be run in a batch or interactive mode. To run it in batch, simply save
your Pig commands to a file and pass that file as an argument to the Pig
executable. To run commands interactively, you can run the Pig executable
from the command prompt.
Pigusesalanguage,PigLatin,todefinethedatatransformationsthatwillbe
done. Pig Latin statements are operators that take a relation and produces
another relation. A relation , in Pig Latin terms, is a collection of tuples,
and a tuple is a collection of fields. One way to envision this is that a
relation is like a table in a database. The table has a collection ofrows, which
is analogous to the tuples. The columns in the row are analogous to the
fields. The primary difference between a relation and a database table is that
relations do not require that all the tuples have the same number or type of
fields in them.
An example Pig Latin statement follows. This statement loads information
from Hadoop into a relation. Statements must be terminated with
semicolons, and extra whitespace is ignored:
Search WWH ::




Custom Search