Database Reference
In-Depth Information
The addInputPath() and addInputPaths() methods add a path or paths to the
list of inputs. You can call these methods repeatedly to build the list of paths. The
setInputPaths() methods set the entire list of paths in one go (replacing any paths
set on the Job in previous calls).
A path may represent a file, a directory, or, by using a glob, a collection of files and dir-
ectories. A path representing a directory includes all the files in the directory as input to
the job. See File patterns for more on using globs.
WARNING
The contents of a directory specified as an input path are not processed recursively. In fact, the directory
should only contain files. If the directory contains a subdirectory, it will be interpreted as a file, which
will cause an error. The way to handle this case is to use a file glob or a filter to select only the files in
the directory based on a name pattern. Alternatively, mapre-
duce.input.fileinputformat.input.dir.recursive can be set to true to force the in-
put directory to be read recursively.
The add and set methods allow files to be specified by inclusion only. To exclude certain
files from the input, you can set a filter using the setInputPathFilter() method on
FileInputFormat . Filters are discussed in more detail in PathFilter .
Even if you don't set a filter, FileInputFormat uses a default filter that excludes hid-
den files (those whose names begin with a dot or an underscore). If you set a filter by call-
ing setInputPathFilter() , it acts in addition to the default filter. In other words,
only nonhidden files that are accepted by your filter get through.
Paths and filters can be set through configuration properties, too ( Table 8-4 ), which can be
handy for Streaming jobs. Setting paths is done with the -input option for the Stream-
ing interface, so setting paths directly usually is not needed.
Search WWH ::




Custom Search