Lustre - High Performance Parallel I/O

Hardware Reference

In-Depth Information

responsible to map the logical file offset for each I/O request to a specific OST

object and its object-local offset. In the RAID 0 pattern, files with multiple

objects are mapped in a round-robin fashion, and the size of each object is

approximately the total file size modulo the number of objects.

Typically the le layout does not change during the le's lifetime. If this

is necessary (e.g., to change the number of stripes or migrate it to different

OSTs) the MDS can revoke the layout lock, which drops it from the clients'

cache. The client will refetch the new layout from the MDS with a DLM lock

request upon its next access.

In order to avoid a single point of contention during writes to a file with

multiple OST objects, the object size and client-generated timestamps are

stored with each write only on the OST object being modified. The aggre-

gate file size and timestamp is only computed from the objects and layout as

needed, such as stat or append operations. The MDT inode object stores the

other attributes such as owner, group, permissions, ACLs, xattrs, etc. When

combining the OST and MDT object attributes of a file for stat(), the object

with the newest change time provides the access and data modification time.

Clients keep data and metadata DLM locks referenced only for the duration

of a single system call. They cache unreferenced DLM locks in a variable- sized

LRU list per target, that is managed in conjunction with hints from the lock

servers. For as long as the locks are held by the client, it can cache data,

attributes, ACLs, and directory contents. In the common use case of a single

client performing uncontended reads or writes of a file, only a single-lock RPC

is needed for all data access, since the server will return a full-object lock on

the first enqueue. Similarly, when a client holds a directory-update lock, it can

cache all of the directory entries locally for lookup, as well as cache negative

entries for names that do not exist, until the directory lock is revoked.

Clients aggregate I/O in their local caches to ensure bulk data is streamed

to or from the servers eciently. On read, the client can detect strided read

patterns and use this to guide readahead. Similarly on write, dirty pages

are aggregated whenever possible. In both cases, many aggregated bulk data

RPCs may be kept \on the wire" to hide latency and ensure full bandwidth

utilization of the underlying networking and storage hardware and both the

number of size of outstanding RPCs can be tuned e.g., to support remote

mount over wide-area networks. The Lustre client also takes care to align these

aggregated bulk RPCs at regular offsets and sizes to help servers maximize

consistency between allocating writes and subsequent reads and reduce disk

seeks, and also to align the disk I/O operations with the underlying RAID

storage chunks.

In order to manage unwritten data in the clients' write-back cache, each

client is given a grant of space from each OST. The grant space is consumed

with each write request, and refilled with each write reply subject to avail-

ability of space on the OST. This ensures that the cached client writes cannot

exceed the available space in the file system.

High Performance Parallel I/O

Search WWH ::

Custom Search

Home