Databases Reference
In-Depth Information
}
public void readFields(DataInput in) throws IOException {
id = in.readInt();
timestamp = in.readLong();
}
public void write(PreparedStatement statement) throws SQLException {
statement.setInt(1, id);
statement.setLong(2, timestamp);
}
public void readFields(ResultSet resultSet) throws SQLException {
id = resultSet.getInt(1);
timestamp = resultSet.getLong(2);
}
}
We want to emphasize again that reading and writing to databases from within Hadoop
is only appropriate for data sets that are relatively small by Hadoop standards. Unless
your database setup is as parallel as Hadoop (which can be the case if your Hadoop
cluster is relatively small while you have many shards in your database system), your
DB will be the performance bottleneck, and you may not gain any scalability advan-
tage from your Hadoop cluster. Oftentimes, it's better to bulk load data into a data-
base
rather than make direct writes from Hadoop. You'll need custom solutions for
extremely large-scale databases. 1
7.5
Keeping all output
in sorted order
The MapReduce framework guarantees the input to each reducer to be in sorted
order based on key. In many cases, the reducer only does a simple computation on
the value part of a key/value pair. The output also stays in sorted order. Keep in mind
that the MapReduce framework does not guarantee the sorted order of the reducer
output. Rather, it's a byproduct of the sorted input and the typical type of operations
reducers perform.
For some applications, the sorted order
is unnecessary, and sometimes questions
are raised about turning off the sorting operation to eliminate an unnecessary step
in the reducer. The truth is that the sorting operation is not so much about enforcing
the sorted order of the reducer's input. Rather, sorting is an efficient way to group
all records of the same key together. If the grouping function is unnecessary, then
we can directly generate an output record from a single input record. In that case,
you'll be able to improve performance by eliminating the entire reduce phase. You
can do this by setting the number of reducers to 0, making the application a map-
only job.
1
LinkedIn has an interesting blog post on challenges faced in moving massive amounts of data resulting from
offline processes (i.e., Hadoop) into live systems: http://project-voldemort.com/blog/2009/06/building-a-
1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/.
 
Search WWH ::




Custom Search