Models and Software Model Checking of a Distributed File Replication System - Formal Methods and Hybrid Real-Time Systems

Information Technology Reference

In-Depth Information

impression was that we had the luxury of tackling a relatively well defined prob-

lem; to build a replication system specifically handling features of the file system

NTFS, for replicating files between globally dispersed branch oces of corpora-

tions. Later on, it would turn out that DFS-R could be embedded within other

scenarios, such as, in an instant messenger product. However, we consciously

avoided over-loading with features from the onset. It means that DFS-R, for in-

stance does not replicate files synchronously, only asynchronously (as it is meant

for wide area networks); does not replicate general directed acyclic graphs, only

tree-like structure; and does not maintain fine-grained tracking of operations,

only state. While several such problems are interesting in other contexts, they

did not fall into the scope of our original goals.

The organization of this paper follows the top-down design flow of DFS-R.

The DFS-R system was originally conceived as a strictly state-based file repli-

cation protocol. Section 2 elaborates on the differences between state-based and

operations-based replication systems. We developed a high-level state machine

specification of DFS-R by using a transition system presented as a collection of

guarded commands. The guarded commands were subsequently implemented as

an applicative program in OCaml. This paved the way for performing ecient

state space exploration on top of the design. Section 3 elaborates on the proto-

col, and Section 4 summarizes prototyping experiences. As the development took

place, several assumptions made in the abstract design turned out to be unreal-

istic, and we redid the high-level design using the AsmL tools that were built at

Microsoft for software modeling and test case generation. Section 5 elaborates

on the experiences from using AsmL. A number of well-separated distributed

protocol problems emerged during the development. Section 6 describes the dis-

tributed tree reconciliation problem, and how we used a model checker, Zing, to

expose bugs in both protocol proposals and legacy implementations. Section 7

describes the distributed tombstone garbage collection problem and a solution to

it. While one cannot expect to get anywhere without a high-level understanding

of the protocols involved in DFS-R, it is equally unrealistic to expect develop-

ing a production quality system without addressing systems problems. We were

thus faced with a potentially large gap between simplified protocol substrates

and the production code. Encouraged by the ability of the model-based state

space exploration to expose subtle interaction bugs we repeated the state space

exploration experiment on top of the production core. The resulting backtrack-

ing search tool may best be characterized as a hybrid software model checking,

run-time verification tool. It operates directly at the source code level. It uses

techniques, such as partial order reduction to prune search and custom allocation

routines to enable backtracking search. Section 8 describes the infrastructure we

developed and the experiments covering

1

2

trillion scenarios.

2

File Replication

The style of replication systems under which DFS-R falls into is surveyed ex-

tensively in [1]. We here summarize a few of the main concepts relevant for

Formal Methods and Hybrid Real-Time Systems

Search WWH ::

Custom Search

Home