Cookbook - Hadoop in Action

Databases Reference

In-Depth Information

}

public void readFields(DataInput in) throws IOException {

id = in.readInt();

timestamp = in.readLong();

}

public void write(PreparedStatement statement) throws SQLException {

statement.setInt(1, id);

statement.setLong(2, timestamp);

}

public void readFields(ResultSet resultSet) throws SQLException {

id = resultSet.getInt(1);

timestamp = resultSet.getLong(2);

}

We want to emphasize again that reading and writing to databases from within Hadoop

is only appropriate for data sets that are relatively small by Hadoop standards. Unless

your database setup is as parallel as Hadoop (which can be the case if your Hadoop

cluster is relatively small while you have many shards in your database system), your

DB will be the performance bottleneck, and you may not gain any scalability advan-

tage from your Hadoop cluster. Oftentimes, it's better to bulk load data into a data-

base

rather than make direct writes from Hadoop. You'll need custom solutions for

extremely large-scale databases. 1

7.5

Keeping all output

in sorted order

The MapReduce framework guarantees the input to each reducer to be in sorted

order based on key. In many cases, the reducer only does a simple computation on

the value part of a key/value pair. The output also stays in sorted order. Keep in mind

that the MapReduce framework does not guarantee the sorted order of the reducer

output. Rather, it's a byproduct of the sorted input and the typical type of operations

reducers perform.

For some applications, the sorted order

is unnecessary, and sometimes questions

are raised about turning off the sorting operation to eliminate an unnecessary step

in the reducer. The truth is that the sorting operation is not so much about enforcing

the sorted order of the reducer's input. Rather, sorting is an efficient way to group

all records of the same key together. If the grouping function is unnecessary, then

we can directly generate an output record from a single input record. In that case,

you'll be able to improve performance by eliminating the entire reduce phase. You

can do this by setting the number of reducers to 0, making the application a map-

only job.

1

LinkedIn has an interesting blog post on challenges faced in moving massive amounts of data resulting from

offline processes (i.e., Hadoop) into live systems: http://project-voldemort.com/blog/2009/06/building-a-

1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/.

Hadoop in Action

Search WWH ::

Custom Search

Home