Database Reference
In-Depth Information
If you specify the
listblocks
parameter,
http://
datanode
:50075/blockScannerRe-
port?listblocks
, the report is preceded by a list of all the blocks on the datanode along with
their latest verification status. Here is a snippet of the block list (lines are split to fit the
page):
blk_6035596358209321442 : status : ok type : none scan time
:
0 not yet verified
blk_3065580480714947643 : status : ok type : remote scan time
:
1215755306400 2008-07-11 05:48:26,400
blk_8729669677359108508 : status : ok type : local scan time
:
1215755727345 2008-07-11 05:55:27,345
The first column is the block ID, followed by some key-value pairs. The status can be one
of
failed
or
ok
, according to whether the last scan of the block detected a checksum er-
ror. The type of scan is
local
if it was performed by the background thread,
remote
if
it was performed by a client or a remote datanode, or
none
if a scan of this block has yet
to be made. The last piece of information is the scan time, which is displayed as the num-
ber of milliseconds since midnight on January 1, 1970, and also as a more readable value.
Balancer
Over time, the distribution of blocks across datanodes can become unbalanced. An unbal-
anced cluster can affect locality for MapReduce, and it puts a greater strain on the highly
utilized datanodes, so it's best avoided.
The
balancer
program is a Hadoop daemon that redistributes blocks by moving them from
overutilized datanodes to underutilized datanodes, while adhering to the block replica
placement policy that makes data loss unlikely by placing block replicas on different racks
(see
Replica Placement
)
. It moves blocks until the cluster is deemed to be balanced, which
means that the utilization of every datanode (ratio of used space on the node to total capa-
city of the node) differs from the utilization of the cluster (ratio of used space on the
cluster to total capacity of the cluster) by no more than a given threshold percentage. You
can start the balancer with:
%
start-balancer.sh
The
-threshold
argument specifies the threshold percentage that defines what it
means for the cluster to be balanced. The flag is optional; if omitted, the threshold is 10%.
At any one time, only one balancer may be running on the cluster.