Lustre - High Performance Parallel I/O

Hardware Reference

In-Depth Information

8.2.9 Recovery

Lustre exploits caches on both client and server to hide disk and network

latency for performance-critical operations. Since failures can prevent these

caches from being flushed, Lustre clients and storage targets retain stateful

connections to ensure transactions that are completed but not yet committed

at the storage target are replayed transparently in the event of server failure

and subsequent restart or failover.

The client-side state contains a list of complete but uncommitted RPCs,

ordered by the server's transaction sequence number, incomplete RPCs or-

dered by the client execution ID (XID), and locks held by the client at each

storage target. When clients reconnect to the storage target after a service

interruption, these RPCs and locks are replayed to complete recovery and al-

low the storage target to continue service. Storage targets also retain a list

of currently connected clients in the lastrcvd file, and ensure this is persis-

tent for newly connected clients before their first RPC that modifies data or

metadata completes to ensure the target knows which clients may later need

to participate in recovery.

When a storage target starts up, it notifies the MGS, which in turn notifies

all clients that were connected to the last running instance of the storage

target. In the event that the MGS is not present or unresponsive, clients

discover this for themselves, albeit with substantially increased latency, by

attempting to connect to all previously configured target NIDs in turn until

successful.

The storage target then waits for all clients that were connected at the time

of failure to reconnect, replay all uncommitted RPCs, and re-acquire in-use

locks. Locks that are not currently protecting any cached state are dropped to

reduce the time needed for lock recovery. Recovery time is bounded to ensure

that unresponsive clients cannot hold up recovery indefinitely. A tighter bound

is used when the MGS is able to assist with restart notification.

Executed but uncommitted RPC replay is then conducted in strict transac-

tion order to ensure that transaction dependencies within and between clients

are respected. Replayed RPCs may also include the versions (previous trans-

action numbers) of the objects they updated to allow consistent replay of

isolated operations even if unrelated operations were lost due to concurrent

client failure (causing a gap in the transaction sequence numbers).

Next, clients replay their held locks, which also recovers the state of open

files on the MDS. Finally, clients resend incomplete RPCs in XID order. Since

these RPCs may or may not have been committed by the server before the

failure, the server guarantees idempotence by comparing the RPC XID against

that stored for the client in the lastrcvd file to determine if this operation

was previously committed. If so, the server does not execute the update and

reconstructs the RPC reply message from data saved in the lastrcvd file.

Clients that do not participate in recovery in a timely manner are

evicted from the server and their uncommitted operations are lost. Dependent

High Performance Parallel I/O

Search WWH ::

Custom Search

Home