Database Reference
In-Depth Information
(with a fully distributed cluster) is what you use when you want to run Pig on large data-
sets.
To use MapReduce mode, you first need to check that the version of Pig you downloaded
is compatible with the version of Hadoop you are using. Pig releases will only work
against particular versions of Hadoop; this is documented in the release notes.
Pig honors the HADOOP_HOME environment variable for finding which Hadoop client to
run. However, if it is not set, Pig will use a bundled copy of the Hadoop libraries. Note
that these may not match the version of Hadoop running on your cluster, so it is best to
explicitly set HADOOP_HOME .
Next, you need to point Pig at the cluster's namenode and resource manager. If the install-
ation of Hadoop at HADOOP_HOME is already configured for this, then there is nothing
more to do. Otherwise, you can set HADOOP_CONF_DIR to a directory containing the
Hadoop site file (or files) that define fs.defaultFS ,
yarn.resourcemanager.address , and mapreduce.framework.name (the
latter should be set to yarn ).
Alternatively, you can set these properties in the pig.properties file in Pig's conf directory
(or the directory specified by PIG_CONF_DIR ). Here's an example for a pseudo-distrib-
uted setup:
fs.defaultFS=hdfs://localhost/
mapreduce.framework.name=yarn
yarn.resourcemanager.address=localhost:8032
Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting
the -x option to mapreduce or omitting it entirely, as MapReduce mode is the default.
We've used the -brief option to stop timestamps from being logged:
% pig -brief
Logging error messages to: /Users/tom/pig_1414246949680.log
Default bootup file /Users/tom/.pigbootup not found
Connecting to hadoop file system at: hdfs://localhost/
grunt>
As you can see from the output, Pig reports the filesystem (but not the YARN resource
manager) that it has connected to.
In MapReduce mode, you can optionally enable auto-local mode (by setting
pig.auto.local.enabled to true ), which is an optimization that runs small jobs
locally if the input is less than 100 MB (set by
Search WWH ::




Custom Search