Information Technology Reference
In-Depth Information
to avoid false alarms. We present below an admission-scheduler-based (ASB) protocol for
detecting server failures.
In our previous investigations, we found that incoming control requests could be delayed
for a substantial amount of time (e.g., more than one second) due to intense I/O activities at
the servers. Consequently, it would be more difficult to implement server-based fault-detection
protocols that can quickly detect a failure. This motivates us to implement fault-detection
at the admission scheduler rather than at the servers. The admission scheduler is originally
introduced to tackle the uneven group assignment problem arising from server clock jitters (cf.
Section 10.4). For fault detection, we extend the admission scheduler to simulate a video client.
Unlike real video clients, however, received video data are simply discarded at the admission
scheduler after bookkeeping is done, and the scheduler never performs any interactive control
nor will the stream ever terminate (until system shutdown). At the servers, video data destined
to the admission scheduler are not retrieved from the disks, but rather generated on-the-fly.
Since the generated video data will not be interpreted at the admission scheduler, the server can
avoid disk overhead by sending the same buffer repeatedly after updating header information
such as stream offset or sequence number.
When a server fails, it simply stops transmitting data. Hence, a server failure can be inferred
by the lack of video data received at the admission scheduler. We assume that the admission
scheduler is located close to the servers so that worst-case arrival deadlines are known for each
and every video packet. Then the admission scheduler can declare a server to have failed if the
arrival deadline is exceeded by a threshold of say, T ASB seconds. This threshold is introduced to
reduce the possibility of false alarms caused by unexpected data delivery delays or occasional
packet losses.
Note that the admission scheduler itself could also fail. However, this type of failure will
be less problematic because (a) while new streams cannot be started, the failure will not affect
existing streams; and (b) compared to the video servers, the admission scheduler is much
simpler and hence potentially far more reliable. For example, the admission scheduler can be
diskless so that disk failure can be avoided. ECC memory can be used to protect from memory
faults, etc.
11.3.2 Server Reconfiguration for Block Striping
Upon declaring that a server has failed, the admission scheduler will send messages to
the surviving servers to notify them of the failure. The delay incurred will obviously be
implementation-dependent. For simplicity, we assume that the failure-detection delay is
bounded and the maximum is given by T D seconds. Upon receiving the failure notification,
the servers will initiate a reconfiguration process to begin transmitting redundant blocks and
to retransmit the necessary redundant blocks.
Figure 11.4 depicts the scenario for reconfiguring a 5-server system under block striping.
Note that we consider only one video stream for illustration and analysis while in practice the
same process occurs for all active video streams. All algorithms and procedures still apply and
no modification is needed for the multi-stream case. Note also that redundant video blocks are
always retrieved, just not transmitted when there is no failure. One might notice that during
normal operation, some disk bandwidth would then be wasted in retrieving redundant blocks
that are not needed. It is conceivable that one can reuse this wasted bandwidth to serve extra
video sessions during normal operation. However, these sessions will have to be disconnected
Search WWH ::




Custom Search