Hadoop I/O - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Data Integrity

Users of Hadoop rightly expect that no data will be lost or corrupted during storage or pro-

cessing. However, because every I/O operation on the disk or network carries with it a

small chance of introducing errors into the data that it is reading or writing, when the

volumes of data flowing through the system are as large as the ones Hadoop is capable of

handling, the chance of data corruption occurring is high.

The usual way of detecting corrupted data is by computing a checksum for the data when it

first enters the system, and again whenever it is transmitted across a channel that is unreli-

able and hence capable of corrupting the data. The data is deemed to be corrupt if the

newly generated checksum doesn't exactly match the original. This technique doesn't offer

any way to fix the data — it is merely error detection. (And this is a reason for not using

low-end hardware; in particular, be sure to use ECC memory.) Note that it is possible that

it's the checksum that is corrupt, not the data, but this is very unlikely, because the check-

sum is much smaller than the data.

A commonly used error-detecting code is CRC-32 (32-bit cyclic redundancy check), which

computes a 32-bit integer checksum for input of any size. CRC-32 is used for checksum-

ming in Hadoop's ChecksumFileSystem , while HDFS uses a more efficient variant

called CRC-32C.

Data Integrity in HDFS

HDFS transparently checksums all data written to it and by default verifies checksums

when reading data. A separate checksum is created for every dfs.bytes-per-check-

sum bytes of data. The default is 512 bytes, and because a CRC-32C checksum is 4 bytes

long, the storage overhead is less than 1%.

Datanodes are responsible for verifying the data they receive before storing the data and its

checksum. This applies to data that they receive from clients and from other datanodes dur-

ing replication. A client writing data sends it to a pipeline of datanodes (as explained in

Chapter 3 ) , and the last datanode in the pipeline verifies the checksum. If the datanode de-

tects an error, the client receives a subclass of IOException , which it should handle in

an application-specific manner (for example, by retrying the operation).

When clients read data from datanodes, they verify checksums as well, comparing them

with the ones stored at the datanodes. Each datanode keeps a persistent log of checksum

verifications, so it knows the last time each of its blocks was verified. When a client suc-

cessfully verifies a block, it tells the datanode, which updates its log. Keeping statistics

such as these is valuable in detecting bad disks.

Search WWH ::

Custom Search

Home