Database Reference
In-Depth Information
can be on the local filesystem, on HDFS, or on another Hadoop-readable filesystem (such
as S3). If no scheme is supplied, then the files are assumed to be local. (This is true even
when the default filesystem is not the local filesystem.)
You can also copy archive files (JAR files, ZIP files, tar files, and gzipped tar files) to
your tasks using the
-archives
option; these are unarchived on the task node. The
-
libjars
option will add JAR files to the classpath of the mapper and reducer tasks. This
is useful if you haven't bundled library JAR files in your job JAR file.
Let's see how to use the distributed cache to share a metadata file for station names. The
command we will run is:
%
hadoop jar hadoop-examples.jar \
MaxTemperatureByStationNameUsingDistributedCacheFile \
-files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all
output
This command will copy the local file
stations-fixed-width.txt
(no scheme is supplied, so
the path is automatically interpreted as a local file) to the task nodes, so we can use it to
look up station names. The listing for
MaxTemperatureBySta-
tionNameUsingDistributedCacheFile
appears in
Example 9-13
.
Example 9-13. Application to find the maximum temperature by station, showing station
names from a lookup table passed as a distributed cache file
public class
MaxTemperatureByStationNameUsingDistributedCacheFile
extends
Configured
implements
Tool
{
static class
StationTemperatureMapper
extends
Mapper
<
LongWritable
,
Text
,
Text
,
IntWritable
> {
private
NcdcRecordParser parser
=
new
NcdcRecordParser
();
@Override
protected
void
map
(
LongWritable key
,
Text value
,
Context context
)
throws
IOException
,
InterruptedException
{
parser
.
parse
(
value
);
if
(
parser
.
isValidTemperature
()) {
context
.
write
(
new
Text
(
parser
.
getStationId
()),
new
IntWritable
(
parser
.
getAirTemperature
()));
}
}
}
static class
MaxTemperatureReducerWithStationLookup