Database Reference
In-Depth Information
Data Integrity
Users of Hadoop rightly expect that no data will be lost or corrupted during storage or pro-
cessing. However, because every I/O operation on the disk or network carries with it a
small chance of introducing errors into the data that it is reading or writing, when the
volumes of data flowing through the system are as large as the ones Hadoop is capable of
handling, the chance of data corruption occurring is high.
The usual way of detecting corrupted data is by computing a checksum for the data when it
first enters the system, and again whenever it is transmitted across a channel that is unreli-
able and hence capable of corrupting the data. The data is deemed to be corrupt if the
newly generated checksum doesn't exactly match the original. This technique doesn't offer
any way to fix the data — it is merely error detection. (And this is a reason for not using
low-end hardware; in particular, be sure to use ECC memory.) Note that it is possible that
it's the checksum that is corrupt, not the data, but this is very unlikely, because the check-
sum is much smaller than the data.
A commonly used error-detecting code is CRC-32 (32-bit cyclic redundancy check), which
computes a 32-bit integer checksum for input of any size. CRC-32 is used for checksum-
ming in Hadoop's ChecksumFileSystem , while HDFS uses a more efficient variant
called CRC-32C.
Data Integrity in HDFS
HDFS transparently checksums all data written to it and by default verifies checksums
when reading data. A separate checksum is created for every dfs.bytes-per-check-
sum bytes of data. The default is 512 bytes, and because a CRC-32C checksum is 4 bytes
long, the storage overhead is less than 1%.
Datanodes are responsible for verifying the data they receive before storing the data and its
checksum. This applies to data that they receive from clients and from other datanodes dur-
ing replication. A client writing data sends it to a pipeline of datanodes (as explained in
Chapter 3 ) , and the last datanode in the pipeline verifies the checksum. If the datanode de-
tects an error, the client receives a subclass of IOException , which it should handle in
an application-specific manner (for example, by retrying the operation).
When clients read data from datanodes, they verify checksums as well, comparing them
with the ones stored at the datanodes. Each datanode keeps a persistent log of checksum
verifications, so it knows the last time each of its blocks was verified. When a client suc-
cessfully verifies a block, it tells the datanode, which updates its log. Keeping statistics
such as these is valuable in detecting bad disks.
Search WWH ::




Custom Search