Lustre - High Performance Parallel I/O

Hardware Reference

In-Depth Information

tion greatly simplifies coding of algorithms designed to keep large deep pipes

full, message passing is ideal for RPC request processing and RMA not only

enables zero-copy for bulk data movement but also limits congestion by giving

control of this movement to the server.

Lustre has to operate in a heterogeneous networking environment where

communications span different network types. Consider a compute cluster with

its own HPC fabric connected via gateway nodes to a site-wide storage facility.

Most ecient use of available network resources occurs when native protocols

can be used on both the HPC fabric and the site-wide storage network. LNet

diverged from Sandia Portals to accommodate this usage model by dividing

the Portals network ID (NID) into a two-level network address including a

network number and an address within that network. LNet therefore includes

a routing subsystem that enables communications to span multiple networks

connected by multiple routers for resilience and scalable throughput.

LNet is implemented in two layers. The upper layer implements all

generic communications while the lower layer abstracts physical networks and

network-specific protocols through the Lustre Network Driver (LND). LNet

therefore supports a wide range of networks including TCP/IP, all fabrics such

as Infiniband supported by OFED, and HPC fabrics with non-standard APIs

such as the Cray Seastar and Gemini networks.

When the underlying network supports only two-sided communications,

the LND will typically support zero-copy on sends but is forced to copy from

pre-posted network buffers into MBs posted by the upper levels. However,

when the underlying network supports RDMA, the LND implements small

message queues both for small LNet communications and to negotiate RDMA

for large MBs to support zero-copy for both incoming and outgoing bulk data.

These message queues use a system of credits for peer-to-peer communications

to avoid congestion problems associated with having to handle unsolicited

messages.

LNDs may also be tuned to support \long fat pipes" eciently by increas-

ing message queue depth and RDMA concurrency, and by optionally mapping

RDMA buffers on demand to reduce RDMA fragmentation on the wire. These

optimizations enable Lustre to operate eciently over wide-area networks.

8.2.2.2

RPC

Lustre's PtlRPC layer is designed to support ecient communications be-

tween clients and servers and, most significantly, to maximize server-side con-

trol over network utilization to minimize the problem of congestion. This is

achieved by ensuring that the only unsolicited message a client may send is

the initial RPC request. All subsequent communication, including bulk data

transfer and the final RPC reply is initiated by the server.

A Lustre RPC progresses in phases. First the client must create MBs for

the request and reply buffers and any bulk data buffers. MEs for the bulk

and reply MBs are then attached for access only by the server targeted by

High Performance Parallel I/O

Search WWH ::

Custom Search

Home