Parallel Data Storage and Access - Scientific Data Management

Database Reference

In-Depth Information

The combination of small form-factor, cost-effective SCSI disks for small

computers and linear block address space led to disk arrays in the early

1990s. A disk array is a set of disks grouped together, usually into a com-

mon physical box called the array, and representing itself as a much larger

“virtual” disk with a linear block address space that interleaves the virtual

disk blocks across the component physical disks. Arrays promised higher ag-

gregate bandwidth, more concurrent random accesses per second, and cost-

and volumetric-effective large storage systems. But with many more me-

chanical disk devices in an array, the component failure rates also rise. In

a paper called “A Case for Redundant Arrays of Inexpensive Disks,” Pat-

terson, Gibson, and Katz described a taxonomy of “RAID” levels showing

different ways disk arrays could embed redundant copies of stored data. 4

With redundant copies of data, the failure of a disk could be transpar-

ently detected, tolerated, and, with online space disks, repaired. The lead-

ing RAID levels are level 0, nonredundant; level 1, duplication of each data

disk; and level 5, where one disk stores the parity of the other disks so that

a known failed disk can be reconstructed from the XOR of all surviving

disks.

SCSI is still with us, and its lower-cost competitors advanced technology

attachment (ATA) and serial ATA (SATA) share the same linear block ad-

dress space and embedded independent controller. RAID has been relabeled

Redundant Arrays of Independent Disks because the expensive, large form-

factor disks have been displaced by relatively inexpensive, smaller form-factor

disks. RAID is a core data management tool in all large data systems. And

most important, the linear block address space abstraction is the basic and

central storage virtualization scheme at work today.

2.2.1.1

General Parallel File System

IBM's general parallel file system (GPFS) grew out of the Tiger Shark multi-

media file system, developed in the mid-1990s. Variants of GPFS are available

for both AIX and for Linux. GPFS is one of the most widely deployed parallel

file systems today.

GPFS implements a block-based file system, with clients either directly

accessing disk blocks via a storage area network or indirectly accessing disk

blocks through a software layer (called virtual shared disk [VSD] or network

shared disk) that redirects operations over a network to a remote system

that performs access on the client's behalf. In large deployments, the cost of

connecting all clients to the storage area network is usually prohibitive, so the

software-assisted block access is more often employed.

Since GPFS is a parallel file system, blocks move through multiple paths,

usually multiple servers, and are striped across multiple devices to allow con-

current access from many clients and high aggregate throughput. To match

the network bandwidths of today's servers, disks used in GPFS deployments

are typically combined into arrays. Files are striped across all RAIDs with

Scientific Data Management

Search WWH ::

Custom Search

Home