Database Reference
In-Depth Information
(with a fully distributed cluster) is what you use when you want to run Pig on large data-
sets.
To use MapReduce mode, you first need to check that the version of Pig you downloaded
is compatible with the version of Hadoop you are using. Pig releases will only work
against particular versions of Hadoop; this is documented in the release notes.
Pig honors the
HADOOP_HOME
environment variable for finding which Hadoop client to
run. However, if it is not set, Pig will use a bundled copy of the Hadoop libraries. Note
that these may not match the version of Hadoop running on your cluster, so it is best to
explicitly set
HADOOP_HOME
.
Next, you need to point Pig at the cluster's namenode and resource manager. If the install-
ation of Hadoop at
HADOOP_HOME
is already configured for this, then there is nothing
more to do. Otherwise, you can set
HADOOP_CONF_DIR
to a directory containing the
Hadoop site file (or files) that define
fs.defaultFS
,
yarn.resourcemanager.address
, and
mapreduce.framework.name
(the
latter should be set to
yarn
).
Alternatively, you can set these properties in the
pig.properties
file in Pig's
conf
directory
(or the directory specified by
PIG_CONF_DIR
). Here's an example for a pseudo-distrib-
uted setup:
fs.defaultFS=hdfs://localhost/
mapreduce.framework.name=yarn
yarn.resourcemanager.address=localhost:8032
Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting
the
-x
option to
mapreduce
or omitting it entirely, as MapReduce mode is the default.
We've used the
-brief
option to stop timestamps from being logged:
%
pig -brief
Logging error messages to: /Users/tom/pig_1414246949680.log
Default bootup file /Users/tom/.pigbootup not found
Connecting to hadoop file system at: hdfs://localhost/
grunt>
As you can see from the output, Pig reports the filesystem (but not the YARN resource
manager) that it has connected to.
In MapReduce mode, you can optionally enable
auto-local mode
(by setting
pig.auto.local.enabled
to
true
), which is an optimization that runs small jobs
locally if the input is less than 100 MB (set by