Database Reference
In-Depth Information
File patterns
It is a common requirement to process sets of files in a single operation. For example, a
MapReduce job for log processing might analyze a month's worth of files contained in a
number of directories. Rather than having to enumerate each file and directory to specify
the input, it is convenient to use wildcard characters to match multiple files with a single
expression, an operation that is known as
globbing
. Hadoop provides two
FileSystem
methods for processing globs:
public
FileStatus
[]
globStatus
(
Path pathPattern
)
throws
IOException
public
FileStatus
[]
globStatus
(
Path pathPattern
,
PathFilter filter
)
throws
IOException
The
globStatus()
methods return an array of
FileStatus
objects whose paths
match the supplied pattern, sorted by path. An optional
PathFilter
can be specified to
restrict the matches further.
Hadoop supports the same set of glob characters as the Unix bash shell (see
Table 3-2
).
Table 3-2. Glob characters and their meanings
Glob Name
Matches
asterisk
Matches zero or more characters
*
question mark
Matches a single character
?
[ab]
character class
Matches a single character in the set
{a, b}
[^ab]
negated charac-
ter class
Matches a single character that is not in the set
{a, b}
[a-b]
character range
Matches a single character in the (closed) range
[a, b]
, where
a
is lexico-
graphically less than or equal to
b
negated charac-
ter range
Matches a single character that is not in the (closed) range
[a, b]
, where
a
is
lexicographically less than or equal to
b
[^a-
b]
{a,b}
alternation
Matches either expression
a
or
b
escaped char-
acter
Matches character
c
when it is a metacharacter
\c
Imagine that logfiles are stored in a directory structure organized hierarchically by date.
So, logfiles for the last day of 2007 would go in a directory named
/2007/12/31
, for ex-
ample. Suppose that the full file listing is: