Database Reference
In-Depth Information
Like Hadoop, Pig's origin began at Yahoo! in 2006. Pig was transferred to the
Apache Software Foundation in 2007 and had its first release as an Apache Hadoop
subproject in 2008. As Pig evolves over time, three main characteristics persist:
ease of programming, behind-the-scenes code optimization, and extensibility of
capabilities [24].
With Apache Hadoop and Pig already installed, the basics of using Pig include
entering the Pig execution environment by typing pig at the command prompt and
then entering a sequence of Pig instruction lines at the grunt prompt.
An example of Pig-specific commands is shown here:
$ pig
grunt> records = LOAD '/user/customer.txt' AS
(cust_id:INT, first_name:CHARARRAY,
last_name:CHARARRAY,
email_address:CHARARRAY);
grunt> filtered_records = FILTER records
BY email_address matches '.*@isp.com';
grunt> STORE filtered_records INTO '/user/isp_customers';
grunt> quit
$
At the first grunt prompt, a text file is designated by the Pig variable records
with four defined fields: cust_id , first_name , last_name , and
email_address . Next, the variable filtered_records is assigned those
records where the email_address ends with @ isp.com to extract the customers
whose e-mail address is from a particular Internet service provider (ISP). Using
the STORE command, the filtered records are written to an HDFS folder,
isp_customers . Finally, to exit the interactive Pig environment, execute the
QUIT command. Alternatively, these individual Pig commands could be written
to the file filter_script.pig and submit them at the command prompt as
follows:
$ pig filter_script.pig
Such Pig instructions are translated, behind the scenes, into one or more
MapReduce jobs. Thus, Pig simplifies the coding of a MapReduce job and enables
the user to quickly develop, test, and debug the Pig code. In this particular
example, the MapReduce job would be initiated after the STORE command is
Search WWH ::




Custom Search