Database Reference
In-Depth Information
The Java Interface
In this section, we dig into the Hadoop FileSystem class: the API for interacting with
one of Hadoop's filesystems. [ 30 ] Although we focus mainly on the HDFS implementation,
DistributedFileSystem , in general you should strive to write your code against the
FileSystem abstract class, to retain portability across filesystems. This is very useful
when testing your program, for example, because you can rapidly run tests using data
stored on the local filesystem.
Reading Data from a Hadoop URL
One of the simplest ways to read a file from a Hadoop filesystem is by using a
java.net.URL object to open a stream to read the data from. The general idiom is:
InputStream in = null ;
try {
in = new URL ( "hdfs://host/path" ). openStream ();
// process in
} finally {
IOUtils . closeStream ( in );
}
There's a little bit more work required to make Java recognize Hadoop's hdfs URL
scheme. This is achieved by calling the setURLStreamHandlerFactory() method
on URL with an instance of FsUrlStreamHandlerFactory . This method can be
called only once per JVM, so it is typically executed in a static block. This limitation
means that if some other part of your program — perhaps a third-party component outside
your control — sets a URLStreamHandlerFactory , you won't be able to use this ap-
proach for reading data from Hadoop. The next section discusses an alternative.
Example 3-1 shows a program for displaying files from Hadoop filesystems on standard
output, like the Unix cat command.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using a
URLStreamHandler
public class URLCat {
static {
URL . setURLStreamHandlerFactory ( new FsUrlStreamHandlerFactory ());
}
public static void main ( String [] args ) throws Exception {
InputStream in = null ;
Search WWH ::




Custom Search