Reliable and Fault-Tolerant Storage Systems - Scalable Continuous Media Streaming Systems

Information Technology Reference

In-Depth Information

rebuild the disk that failed. Nevertheless, this is equivalent to the case without failure prediction

and hence will not degrade rebuild performance. For simplicity, we will not make use of this

failure prediction feature in the rest of the chapter.

5.4.1 Sparing Scheme

To support automatic data rebuild, a dedicated spare disk is reserved to store data reconstructed

in the rebuild process. The spare disk is connected to the server at all times but is not used

during normal mode and degraded mode of operation. In this sparing scheme, the recomputed

data will be stored in the spare disk, which will replace the failed disk once the rebuild process

is completed. Note that human intervention is still required to replace the failed disk with

another spare disk to cater for another disk failure but this is less time-critical.

5.4.2 Rebuild Algorithm

The challenge of automatic rebuild is to proceed with the rebuild process without interrupting

user services. Specifically, all retrievals in a disk service round must finish within T r seconds

and the addition of rebuild requests must not violate this limit. Clearly, we can only utilize

unused disk capacity to serve rebuild requests. Once rebuild blocks from the surviving disks

are retrieved into memory, the server can then perform an erasure-correction computation to

reconstruct the lost media blocks and store them to the spare disk. This process repeats until all

the media blocks lost in the failed disk are reconstructed to the spare disk, which then simply

replaces the failed disk to bring the system back into normal mode of operation. The failed

disk will later be replaced or repaired manually and a new spare disk will be reinserted into

the system to prepare for the next rebuild cycle.

5.4.3 Analysis of Rebuild Time

A key performance metric in evaluating automatic data rebuild algorithms is rebuild time,

defined as the time required to completely rebuild data in the failed disk to the spare disk.

For a server with N D disks (one of which has failed) and one spare disk, the rebuild process

consists of reading ( N D −

1) disks

and reconstructing the lost media block for storage in the spare disk. Note that this is true even

if the failed disk happens to be the parity disk because all ( N D −

1) blocks for each parity group from the surviving ( N D −

1) data blocks in a parity

group are required to recompute the parity block for storage in the spare disk.

Let u

,

0

≤

u

≤

K , be the number of active streams in the server.We define a server utilization

ρ,

0

≤ ρ ≤

1, as follows:

u

K

ρ =

(5.19)

Now the number of rebuild blocks retrieved by a working disk in a service round, denoted

by n b , will be given by

n b =

K

−

u

(5.20)

Scalable Continuous Media Streaming Systems

Search WWH ::

Custom Search

Home