Database Reference
In-Depth Information
The Java Interface
In this section, we dig into the Hadoop
FileSystem
class: the API for interacting with
DistributedFileSystem
, in general you should strive to write your code against the
FileSystem
abstract class, to retain portability across filesystems. This is very useful
when testing your program, for example, because you can rapidly run tests using data
stored on the local filesystem.
Reading Data from a Hadoop URL
One of the simplest ways to read a file from a Hadoop filesystem is by using a
java.net.URL
object to open a stream to read the data from. The general idiom is:
InputStream in
=
null
;
try
{
in
=
new
URL
(
"hdfs://host/path"
).
openStream
();
// process in
}
finally
{
IOUtils
.
closeStream
(
in
);
}
There's a little bit more work required to make Java recognize Hadoop's
hdfs
URL
scheme. This is achieved by calling the
setURLStreamHandlerFactory()
method
on
URL
with an instance of
FsUrlStreamHandlerFactory
. This method can be
called only once per JVM, so it is typically executed in a static block. This limitation
means that if some other part of your program — perhaps a third-party component outside
your control — sets a
URLStreamHandlerFactory
, you won't be able to use this ap-
proach for reading data from Hadoop. The next section discusses an alternative.
Example 3-1
shows a program for displaying files from Hadoop filesystems on standard
output, like the Unix
cat
command.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using a
URLStreamHandler
public class
URLCat
{
static
{
URL
.
setURLStreamHandlerFactory
(
new
FsUrlStreamHandlerFactory
());
}
public static
void
main
(
String
[]
args
)
throws
Exception
{
InputStream in
=
null
;