Lustre - High Performance Parallel I/O

Hardware Reference

In-Depth Information

system named obdfs. Over the next three years, and with early funding from

the ASCI Path Forward project, obdfs evolved into the first versions of Lus-

tre [4] with a debut at number 5 in the Top500 on the 1000-node MCR cluster

at Lawrence Livermore National Laboratory (LLNL) [12]. Continued develop-

ment over the next ten years saw increasing adoption of Lustre on a wide range

of HPC systems in academia and industry. By 2013, Lustre was deployed on 7

out of the top 10 and around 60% of the top 100 supercomputers in the world

as listed by the Top500. Several of these support tens of thousands of clients,

10s of Petabytes of capacity, and I/O performance of over 1TB/s [16, 6].

8.2 Design and Architecture

8.2.1 Overview

Lustre is a Linux file system implemented entirely in the kernel. Its ar-

chitecture is founded upon distributed object-based storage. This delegates

block storage management to its back-end servers and eliminates significant

scaling and performance issues associated with the consistent management of

distributed block storage metadata.

Lustre objects come in two varieties|data objects, which are simple byte

arrays typically used to store the data of POSIX files, and index objects,

which are key-value stores typically used to implement POSIX directories.

These objects are implemented by the Lustre Object Storage Device (OSD),

an abstraction that enables the use of different back-end file systems, including

ext4 and ZFS. A single OSD instance corresponds to a single back-end storage

volume and is termed a storage target. The storage target depends on the

underlying file system for resilience to storage device failure, but may be

instantiated on any server that can attach to this storage to provide high

availability in the event of server or controller failure.

Storage targets are exported either as metadata targets (MDTs), used for

file system namespace operations, or object targets (OSTs), used to store file

data. These are usually exported by servers configured specifically for their

respective metadata or data workloads|e.g., RAID 10 storage hardware and

high core counts for metadata servers (MDSs) and high capacity RAID6 stor-

age hardware and lower core counts for object storage servers (OSSs). Histor-

ically, Lustre clusters have consisted of a pair of MDS nodes configured for

active-passive failover and multiple OSSs configured for active-active failover.

More recent Lustre releases support multiple MDTs in the same file system

and therefore multiple MDS nodes with active-active failover are expected to

become more common.

Lustre clients and servers communicate with each other using a layered

communications stack. The underlying physical and/or logical networks such

Search WWH ::

Custom Search

Home