Hardware Reference
In-Depth Information
tion greatly simplifies coding of algorithms designed to keep large deep pipes
full, message passing is ideal for RPC request processing and RMA not only
enables zero-copy for bulk data movement but also limits congestion by giving
control of this movement to the server.
Lustre has to operate in a heterogeneous networking environment where
communications span different network types. Consider a compute cluster with
its own HPC fabric connected via gateway nodes to a site-wide storage facility.
Most ecient use of available network resources occurs when native protocols
can be used on both the HPC fabric and the site-wide storage network. LNet
diverged from Sandia Portals to accommodate this usage model by dividing
the Portals network ID (NID) into a two-level network address including a
network number and an address within that network. LNet therefore includes
a routing subsystem that enables communications to span multiple networks
connected by multiple routers for resilience and scalable throughput.
LNet is implemented in two layers. The upper layer implements all
generic communications while the lower layer abstracts physical networks and
network-specific protocols through the Lustre Network Driver (LND). LNet
therefore supports a wide range of networks including TCP/IP, all fabrics such
as Infiniband supported by OFED, and HPC fabrics with non-standard APIs
such as the Cray Seastar and Gemini networks.
When the underlying network supports only two-sided communications,
the LND will typically support zero-copy on sends but is forced to copy from
pre-posted network buffers into MBs posted by the upper levels. However,
when the underlying network supports RDMA, the LND implements small
message queues both for small LNet communications and to negotiate RDMA
for large MBs to support zero-copy for both incoming and outgoing bulk data.
These message queues use a system of credits for peer-to-peer communications
to avoid congestion problems associated with having to handle unsolicited
messages.
LNDs may also be tuned to support \long fat pipes" eciently by increas-
ing message queue depth and RDMA concurrency, and by optionally mapping
RDMA buffers on demand to reduce RDMA fragmentation on the wire. These
optimizations enable Lustre to operate eciently over wide-area networks.
8.2.2.2
RPC
Lustre's PtlRPC layer is designed to support ecient communications be-
tween clients and servers and, most significantly, to maximize server-side con-
trol over network utilization to minimize the problem of congestion. This is
achieved by ensuring that the only unsolicited message a client may send is
the initial RPC request. All subsequent communication, including bulk data
transfer and the final RPC reply is initiated by the server.
A Lustre RPC progresses in phases. First the client must create MBs for
the request and reply buffers and any bulk data buffers. MEs for the bulk
and reply MBs are then attached for access only by the server targeted by
 
Search WWH ::




Custom Search