Information Technology Reference
In-Depth Information
rebuild the disk that failed. Nevertheless, this is equivalent to the case without failure prediction
and hence will not degrade rebuild performance. For simplicity, we will not make use of this
failure prediction feature in the rest of the chapter.
5.4.1 Sparing Scheme
To support automatic data rebuild, a dedicated spare disk is reserved to store data reconstructed
in the rebuild process. The spare disk is connected to the server at all times but is not used
during normal mode and degraded mode of operation. In this sparing scheme, the recomputed
data will be stored in the spare disk, which will replace the failed disk once the rebuild process
is completed. Note that human intervention is still required to replace the failed disk with
another spare disk to cater for another disk failure but this is less time-critical.
5.4.2 Rebuild Algorithm
The challenge of automatic rebuild is to proceed with the rebuild process without interrupting
user services. Specifically, all retrievals in a disk service round must finish within T r seconds
and the addition of rebuild requests must not violate this limit. Clearly, we can only utilize
unused disk capacity to serve rebuild requests. Once rebuild blocks from the surviving disks
are retrieved into memory, the server can then perform an erasure-correction computation to
reconstruct the lost media blocks and store them to the spare disk. This process repeats until all
the media blocks lost in the failed disk are reconstructed to the spare disk, which then simply
replaces the failed disk to bring the system back into normal mode of operation. The failed
disk will later be replaced or repaired manually and a new spare disk will be reinserted into
the system to prepare for the next rebuild cycle.
5.4.3 Analysis of Rebuild Time
A key performance metric in evaluating automatic data rebuild algorithms is rebuild time,
defined as the time required to completely rebuild data in the failed disk to the spare disk.
For a server with N D disks (one of which has failed) and one spare disk, the rebuild process
consists of reading ( N D
1) disks
and reconstructing the lost media block for storage in the spare disk. Note that this is true even
if the failed disk happens to be the parity disk because all ( N D
1) blocks for each parity group from the surviving ( N D
1) data blocks in a parity
group are required to recompute the parity block for storage in the spare disk.
Let u
,
0
u
K , be the number of active streams in the server.We define a server utilization
ρ,
0
ρ
1, as follows:
u
K
ρ =
(5.19)
Now the number of rebuild blocks retrieved by a working disk in a service round, denoted
by n b , will be given by
n b =
K
u
(5.20)
Search WWH ::




Custom Search