Adapting Grammar Inference Techniques to Mine State Machines - Mining Software Specifications: Methodologies and Applications

Databases Reference

In-Depth Information

it is in practice rarely possible to obtain an accurate result. These problems

are elaborated below.

(1) Traces only show what sequences of actions are possible, and

not what isimpossible. For the inferred state machine to be accurate, it

must not only incorporate every valid sequence of traces, but also omit ev-

ery in-valid sequence. In other words, correctly inferring its prescribed and

proscribed languages is equally important. The conventional k-tails algorithm

assumes that all traces represent valid system executions (belong to the pre-

scribed language of the underlying system). There is no guidance for the infer-

ence algorithm to describe which sequences are invalid. As a result, whenever

the inference algorithm attempts to produce a machine that generalizes from

the supplied set of traces, it runs a big risk of falsely inferring that invalid

sequences are valid, because there is no evidence to the contrary.

(2) The k-Tails merging rule is prone to erroneous merges. The

k-tails technique described above forms the basis for the majority of current

software state machine inference techniques. Its weakness is due to the fact

that just because two states share a tail of length k does not necessarily mean

that they are equivalent. Furthermore, a pair of states that do not share a

k-tail are also not necessarily distinct. This is particularly notable when the

set of traces is underpopulated (as is usually the case in a practical reverse-

engineering scenario). Finally its use depends on the selection of some value k,

which is critical to the outcome, but for which there is no way of determining

which value is correct. Higher values may lead to the avoidance of erroneous

merges but also rule out merges that should happen. Conversely, lower values

may lead to erroneous merges, but will end up incorporating valid merges as

well.

(3) The selection of traces is critical to the accuracy of the in-

ferred machine. If the set of traces is incomplete, it is impossible for an

algorithm to ensure that those states that are merged are in fact equivalent

(and that unmerged states are actually distinct). The problem is that, in the

context of the conventional state machine inference scenario, there is no no-

tion of what constitutes a complete set of traces. In his influential work on

grammar inference, Gold proved [16] that the inference of a grammar from

positive samples alone was undecidable { a principle that also applies to our

state machine inference scenario (his proof is succinctly summarized by Pul-

lum [29]). Thus, a complete set of traces must somehow include traces that are

impossible or invalid, which belong to the proscribed language of the software

system. For certain state-merging algorithms such as the RPNI algorithm [26]

(which accounts for negative samples as well as positive ones), it has been

shown that a set of traces is complete if they are characteristic of the target

system. Informally, this means that the set of traces must contain traces that

cover every transition in the target system, as well as a suciently large set

of positive and negative samples to distinguish between every pair of distinct

states to avoid erroneous merges. A characteristic sample is usually vast and,

even if we do find a way to collect negative traces, identifying and collecting a

Mining Software Specifications: Methodologies and Applications

Search WWH ::

Custom Search

Home