Reliable and Fault-Tolerant Storage Systems - Scalable Continuous Media Streaming Systems

Information Technology Reference

In-Depth Information

balancing. However, one downside of this disk-array-based storage is reliability. In particular,

failure of any one of the disks in the array will render the server inoperable due to data lost.

Worst still, the reliability will decrease further when one adds more and more disks to scale

up the system capacity, thereby limiting the system's scalability.

This reliability problem has been investigated by many researchers in the last decade [1-8]

and a number of innovative solutions have been proposed and studied. While the exact method

varies, the basic principle is similar, i.e., by adding redundant data to the disks so that data lost

in a failed disk can be reconstructed in real time for delivery to the client.

A media server operates in normal mode when there is no disk failure, and switches into

degraded mode operation once a disk has failed. While existing solutions (e.g., using RAID)

can sustain disk failure without service interruption, operating the media server under degraded

mode is still a temporary measure because additional disk failures will result in system failure

and permanent data loss. Therefore, the media server needs to initiate a rebuild mode to

reconstruct data lost in the failed disk and store them on a spare disk to bring the server back

to normal mode operation. Once the rebuild process is complete, the media server can sustain

another disk failure without total system failure or permanent data loss. This gives the system

operator more time to repair or replace the failed disk with a new spare disk.

It is worth noting that today's hard disks generally have fairly long mean-time-between-

failure (MTBF) ratings, ranging from 300,000 hours to nearly 1,000,000 hours depending on

the disk model. Consider a media server with 16 disks (including one parity disk) plus a spare

disk. The MTBF for the disk array computed using a formula derived by Chen et al . [9] is over

42,000 years if the rebuild time is one hour and 4,200 years if the rebuild time is ten hours.

While an MTBF of 4,200 years may appear to be sufficient, Chen et al . [9] also pointed out that

the computed MTBF should be taken conservatively because disk failures in practice are not

necessarily independent and hence the likelihood of a second disk failure could be much higher

after the first disk failure. As the disk array MTBF is inversely proportional to the rebuild time,

it is therefore important to quickly rebuild the failed disk to prevent total system failure.

This chapter addresses this problem by investigating efficient rebuild algorithms to rebuild

the failed disk automatically and transparently in a media server serving constant-bit-rate

(CBR) media streams. Automatic refers to the fact that the rebuild process does not require hu-

man intervention such as locating and loading a back-up tape to restore data. Transparent refers

to the fact that the rebuild process itself can operate without any adverse effect on existing users.

The rest of this chapter is organized as follows. Section 5.2 reviews some previous works;

Section 5.3 presents and formulates the system model studied in this chapter; Section 5.4

presents and analyzes a block-based rebuild algorithm; Section 5.5 presents and analyzes a

track-based rebuild algorithm; Section 5.6 presents a pipelined rebuild algorithm to reduce

buffer requirement in track-based rebuild; Section 5.7 compares the presented algorithms

quantitatively using numerical results; and Section 5.8 summarizes the chapter and discusses

some future works.

5.2 Background

The problem of supporting degraded mode of operation in media servers has been investi-

gated by a number of researchers [1-8]. One approach makes use of data replications such

as mirroring to sustain disk failure. The idea is to place two or more replicas in different

Scalable Continuous Media Streaming Systems

Search WWH ::

Custom Search

Home