Diagnosis and Performance Monitoring - Deploying and Managing a Cloud Infrastructure

Information Technology Reference

In-Depth Information

This is a crucial point of failure because if a host bus adapter fails during important I/O

operations, the data could be lost. Unfortunately, there is no exact way of telling when it

fails because most electronic devices will just fail without warning. But some would exhibit

signs such as intermittent disconnection with attached storage devices or I/O operations

getting dropped or taking longer than normal.

Memory Failure Main memory is one of the core components of a computer system.

Everyone in IT knows that RAM performance is key to system performance and that mem-

ory failure is not an option, at all. Disks can be configured for backup and redundancy, but

there are no such options for memory. A memory failure can cause an entire system to crash

because the memory module that failed may contain important data that is being used by the

system or its components. A computer will not even start when there is a defective memory

module attached to it. So it is imperative to always check a system's memory, and signs of fail-

ure must detected ahead of time to prevent costly and untimely downtime.

NIC Failure The network interface card (NIC) is a computer system's gateway to the net-

work and beyond. It is the main communication interface and important for a distributed

system that is supposed to be accessible from anywhere in the world. However, it is also

fault tolerant. Losing a NIC would mean losing connectivity, but that does not involve sys-

tem failure. There would be network downtime and the server might not be accessible, but

it is an easily containable and preventable failure through the use of NIC teaming/bonding

or link aggregation. It is certainly not as fatal as memory and CPU failure.

CPU Failure The central processing unit (CPU) is the brain of the computer, hence the

word central . CPU failure would mean utter and total failure. A CPU failure is one of the

worst kinds of failure, in terms of cost and lost productivity, that can occur in your system.

It ensures total shutdown of the system, and most operations will be nonrecoverable. The

CPU is also one of the most expensive parts of the system and one of the hardest to replace

in terms of installation. Unlike a NIC, HBA, or disk, which can all simply be plugged into

the board or various sockets, the CPU must be completely removed and then replaced.

Summary

This chapter is all about performance of the infrastructure rather than the virtualized

environment of cloud computing. We focused more on the concepts of how most of the

hardware parts can perform and fail. The most prominent of these parts is the disk drive,

which is incidentally also the largest bottleneck of the system. The speed and performance

of the disk drive has hardly improved since the 1990s, but the capacity and affordability

of the technology has improved by leaps and bounds. So it is this relatively weak perfor-

mance that we examined. For a disk, its key performance indicators are its access time

and the data transfer rate. The access time is the time it takes for the mechanical parts to

position the read/write head on top of the track and sector that contains the data it is look-

ing for. Taken into account are the spindle speed, which rotates the disk and is measured

Search WWH ::

Custom Search

Home