Hardware Reference
In-Depth Information
committed asynchronously to disk with minimal dependency on the network
protocol. Servers may therefore avoid all unnecessary synchronous disk oper-
ations since clients retain uncommitted transactions for replay in the event of
server failure. The main requirement is that object transactions are commit-
ted in total order. This enables clients to track committed transactions using
a single last committed transaction counter. The last committed transaction
number is piggy-backed on all RPC replies so that clients can prune their
replay buffers when transactions are eventually committed.
8.2.5 Metadata Server
Metadata Server (MDS) nodes export one or more Metadata Targets, each
of which is stored on a single underlying OSD. The MDTs are typically stored
on RAID 1+0 storage arrays that provide good random I/O performance, such
as high-RPM disks or solid-state storage. The MDTs contains the application-
visible file system namespace (filenames, directories), as well as file access
(ownership, permissions, ACLs), file layout, and other attributes.
The MDS controls OST object selection and assignment to files for load
balancing, unless specified directly by the client/application. In order to reduce
latency at file open/create time, the MDS pre-allocates objects on each of the
available OSTs and chooses objects from this pool when a new file is first
opened. This also ensures that the le's layout can be created in a single local
atomic transaction, which avoids a complex distributed operation for each file.
To allow tuning of a file layout optimally for application I/O patterns, it is
possible to specify a different file layout (number of OST stripes, stripe size,
OST storage pool) independently for each file. However, for simplicity it is
possible to specify a default file layout for parent directories that is inherited
by new files created therein.
After a file is opened, the MDS does not participate in client I/O oper-
ations until the file is closed again. To avoid repeatedly modifying the OST
objects during read operations, a le's access time is cached in memory on the
client inodes and OST objects. It is only written to disk on the MDS inode
at close time.
When a file is unlinked from the MDT namespace, its inode and OST ob-
jects are may be destroyed only after the last process holding the file open exits
or closes it. Lustre therefore keeps a persistent reference on the MDT inode
until the last close in case the MDS restarts with open unlinked files. The ob-
ject destroy RPCs are logged in the same transaction that removes the MDT
inode, but only sent when that transaction commits to ensure that a rollback
of the pending MDT unlink does not resurrect references to now-destroyed
OST objects.
As with the OSS, the MDS has its own LDLM servers, one for each MDT
that it services. Each inode can have one or more IBITS DLM locks associated
with its resource. The lock bits are used to protect different aspects of the in-
ode, such as its namespace lookup visibility (name, permission, owner, group,
 
Search WWH ::




Custom Search