Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

(so we can transparently read data written in an older format), and interoperable (so we

can read or write persistent data using different languages).

Hadoop uses its own serialization format, Writables, which is certainly compact and fast,

but not so easy to extend or use from languages other than Java. Because Writables are

central to Hadoop (most MapReduce programs use them for their key and value types),

we look at them in some depth in the next three sections, before looking at some of the

other serialization frameworks supported in Hadoop. Avro (a serialization system that was

designed to overcome some of the limitations of Writables) is covered in Chapter 12 .

The Writable Interface

The Writable interface defines two methods — one for writing its state to a

DataOutput binary stream and one for reading its state from a DataInput binary

stream:

package org . apache . hadoop . io ;

import java.io.DataOutput ;

import java.io.DataInput ;

import java.io.IOException ;

public interface Writable {

void write ( DataOutput out ) throws IOException ;

void readFields ( DataInput in ) throws IOException ;

}

Let's look at a particular Writable to see what we can do with it. We will use

IntWritable , a wrapper for a Java int . We can create one and set its value using the

set() method:

IntWritable writable = new IntWritable ();

writable . set ( 163 );

Equivalently, we can use the constructor that takes the integer value:

IntWritable writable = new IntWritable ( 163 );

To examine the serialized form of the IntWritable , we write a small helper method

that wraps a java.io.ByteArrayOutputStream in a

java.io.DataOutputStream (an implementation of java.io.DataOutput ) to

capture the bytes in the serialized stream:

public static byte [] serialize ( Writable writable ) throws

IOException {

Search WWH ::

Custom Search

Home