Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

(with a fully distributed cluster) is what you use when you want to run Pig on large data-

sets.

To use MapReduce mode, you first need to check that the version of Pig you downloaded

is compatible with the version of Hadoop you are using. Pig releases will only work

against particular versions of Hadoop; this is documented in the release notes.

Pig honors the HADOOP_HOME environment variable for finding which Hadoop client to

run. However, if it is not set, Pig will use a bundled copy of the Hadoop libraries. Note

that these may not match the version of Hadoop running on your cluster, so it is best to

explicitly set HADOOP_HOME .

Next, you need to point Pig at the cluster's namenode and resource manager. If the install-

ation of Hadoop at HADOOP_HOME is already configured for this, then there is nothing

more to do. Otherwise, you can set HADOOP_CONF_DIR to a directory containing the

Hadoop site file (or files) that define fs.defaultFS ,

yarn.resourcemanager.address , and mapreduce.framework.name (the

latter should be set to yarn ).

Alternatively, you can set these properties in the pig.properties file in Pig's conf directory

(or the directory specified by PIG_CONF_DIR ). Here's an example for a pseudo-distrib-

uted setup:

fs.defaultFS=hdfs://localhost/

mapreduce.framework.name=yarn

yarn.resourcemanager.address=localhost:8032

Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting

the -x option to mapreduce or omitting it entirely, as MapReduce mode is the default.

We've used the -brief option to stop timestamps from being logged:

% pig -brief

Logging error messages to: /Users/tom/pig_1414246949680.log

Default bootup file /Users/tom/.pigbootup not found

Connecting to hadoop file system at: hdfs://localhost/

grunt>

As you can see from the output, Pig reports the filesystem (but not the YARN resource

manager) that it has connected to.

In MapReduce mode, you can optionally enable auto-local mode (by setting

pig.auto.local.enabled to true ), which is an optimization that runs small jobs

locally if the input is less than 100 MB (set by

Search WWH ::

Custom Search

Home