Hardware Reference
In-Depth Information
responsible to map the logical file offset for each I/O request to a specific OST
object and its object-local offset. In the RAID 0 pattern, files with multiple
objects are mapped in a round-robin fashion, and the size of each object is
approximately the total file size modulo the number of objects.
Typically the le layout does not change during the le's lifetime. If this
is necessary (e.g., to change the number of stripes or migrate it to different
OSTs) the MDS can revoke the layout lock, which drops it from the clients'
cache. The client will refetch the new layout from the MDS with a DLM lock
request upon its next access.
In order to avoid a single point of contention during writes to a file with
multiple OST objects, the object size and client-generated timestamps are
stored with each write only on the OST object being modified. The aggre-
gate file size and timestamp is only computed from the objects and layout as
needed, such as stat or append operations. The MDT inode object stores the
other attributes such as owner, group, permissions, ACLs, xattrs, etc. When
combining the OST and MDT object attributes of a file for stat(), the object
with the newest change time provides the access and data modification time.
Clients keep data and metadata DLM locks referenced only for the duration
of a single system call. They cache unreferenced DLM locks in a variable- sized
LRU list per target, that is managed in conjunction with hints from the lock
servers. For as long as the locks are held by the client, it can cache data,
attributes, ACLs, and directory contents. In the common use case of a single
client performing uncontended reads or writes of a file, only a single-lock RPC
is needed for all data access, since the server will return a full-object lock on
the first enqueue. Similarly, when a client holds a directory-update lock, it can
cache all of the directory entries locally for lookup, as well as cache negative
entries for names that do not exist, until the directory lock is revoked.
Clients aggregate I/O in their local caches to ensure bulk data is streamed
to or from the servers eciently. On read, the client can detect strided read
patterns and use this to guide readahead. Similarly on write, dirty pages
are aggregated whenever possible. In both cases, many aggregated bulk data
RPCs may be kept \on the wire" to hide latency and ensure full bandwidth
utilization of the underlying networking and storage hardware and both the
number of size of outstanding RPCs can be tuned e.g., to support remote
mount over wide-area networks. The Lustre client also takes care to align these
aggregated bulk RPCs at regular offsets and sizes to help servers maximize
consistency between allocating writes and subsequent reads and reduce disk
seeks, and also to align the disk I/O operations with the underlying RAID
storage chunks.
In order to manage unwritten data in the clients' write-back cache, each
client is given a grant of space from each OST. The grant space is consumed
with each write request, and refilled with each write reply subject to avail-
ability of space on the OST. This ensures that the cached client writes cannot
exceed the available space in the file system.
 
Search WWH ::




Custom Search