Database Reference
In-Depth Information
File patterns
It is a common requirement to process sets of files in a single operation. For example, a
MapReduce job for log processing might analyze a month's worth of files contained in a
number of directories. Rather than having to enumerate each file and directory to specify
the input, it is convenient to use wildcard characters to match multiple files with a single
expression, an operation that is known as globbing . Hadoop provides two FileSystem
methods for processing globs:
public FileStatus [] globStatus ( Path pathPattern ) throws IOException
public FileStatus [] globStatus ( Path pathPattern , PathFilter filter )
throws IOException
The globStatus() methods return an array of FileStatus objects whose paths
match the supplied pattern, sorted by path. An optional PathFilter can be specified to
restrict the matches further.
Hadoop supports the same set of glob characters as the Unix bash shell (see Table 3-2 ).
Table 3-2. Glob characters and their meanings
Glob Name
Matches
asterisk
Matches zero or more characters
*
question mark Matches a single character
?
[ab] character class Matches a single character in the set {a, b}
[^ab] negated charac-
ter class
Matches a single character that is not in the set {a, b}
[a-b] character range Matches a single character in the (closed) range [a, b] , where a is lexico-
graphically less than or equal to b
negated charac-
ter range
Matches a single character that is not in the (closed) range [a, b] , where a is
lexicographically less than or equal to b
[^a-
b]
{a,b} alternation
Matches either expression a or b
escaped char-
acter
Matches character c when it is a metacharacter
\c
Imagine that logfiles are stored in a directory structure organized hierarchically by date.
So, logfiles for the last day of 2007 would go in a directory named /2007/12/31 , for ex-
ample. Suppose that the full file listing is:
Search WWH ::




Custom Search