Information Technology Reference
In-Depth Information
balancing. However, one downside of this disk-array-based storage is reliability. In particular,
failure of any one of the disks in the array will render the server inoperable due to data lost.
Worst still, the reliability will decrease further when one adds more and more disks to scale
up the system capacity, thereby limiting the system's scalability.
This reliability problem has been investigated by many researchers in the last decade [1-8]
and a number of innovative solutions have been proposed and studied. While the exact method
varies, the basic principle is similar, i.e., by adding redundant data to the disks so that data lost
in a failed disk can be reconstructed in real time for delivery to the client.
A media server operates in normal mode when there is no disk failure, and switches into
degraded mode operation once a disk has failed. While existing solutions (e.g., using RAID)
can sustain disk failure without service interruption, operating the media server under degraded
mode is still a temporary measure because additional disk failures will result in system failure
and permanent data loss. Therefore, the media server needs to initiate a rebuild mode to
reconstruct data lost in the failed disk and store them on a spare disk to bring the server back
to normal mode operation. Once the rebuild process is complete, the media server can sustain
another disk failure without total system failure or permanent data loss. This gives the system
operator more time to repair or replace the failed disk with a new spare disk.
It is worth noting that today's hard disks generally have fairly long mean-time-between-
failure (MTBF) ratings, ranging from 300,000 hours to nearly 1,000,000 hours depending on
the disk model. Consider a media server with 16 disks (including one parity disk) plus a spare
disk. The MTBF for the disk array computed using a formula derived by Chen et al . [9] is over
42,000 years if the rebuild time is one hour and 4,200 years if the rebuild time is ten hours.
While an MTBF of 4,200 years may appear to be sufficient, Chen et al . [9] also pointed out that
the computed MTBF should be taken conservatively because disk failures in practice are not
necessarily independent and hence the likelihood of a second disk failure could be much higher
after the first disk failure. As the disk array MTBF is inversely proportional to the rebuild time,
it is therefore important to quickly rebuild the failed disk to prevent total system failure.
This chapter addresses this problem by investigating efficient rebuild algorithms to rebuild
the failed disk automatically and transparently in a media server serving constant-bit-rate
(CBR) media streams. Automatic refers to the fact that the rebuild process does not require hu-
man intervention such as locating and loading a back-up tape to restore data. Transparent refers
to the fact that the rebuild process itself can operate without any adverse effect on existing users.
The rest of this chapter is organized as follows. Section 5.2 reviews some previous works;
Section 5.3 presents and formulates the system model studied in this chapter; Section 5.4
presents and analyzes a block-based rebuild algorithm; Section 5.5 presents and analyzes a
track-based rebuild algorithm; Section 5.6 presents a pipelined rebuild algorithm to reduce
buffer requirement in track-based rebuild; Section 5.7 compares the presented algorithms
quantitatively using numerical results; and Section 5.8 summarizes the chapter and discusses
some future works.
5.2 Background
The problem of supporting degraded mode of operation in media servers has been investi-
gated by a number of researchers [1-8]. One approach makes use of data replications such
as mirroring to sustain disk failure. The idea is to place two or more replicas in different
Search WWH ::




Custom Search