Databases Reference
In-Depth Information
it is in practice rarely possible to obtain an accurate result. These problems
are elaborated below.
(1) Traces only show what sequences of actions are possible, and
not what isimpossible. For the inferred state machine to be accurate, it
must not only incorporate every valid sequence of traces, but also omit ev-
ery in-valid sequence. In other words, correctly inferring its prescribed and
proscribed languages is equally important. The conventional k-tails algorithm
assumes that all traces represent valid system executions (belong to the pre-
scribed language of the underlying system). There is no guidance for the infer-
ence algorithm to describe which sequences are invalid. As a result, whenever
the inference algorithm attempts to produce a machine that generalizes from
the supplied set of traces, it runs a big risk of falsely inferring that invalid
sequences are valid, because there is no evidence to the contrary.
(2) The k-Tails merging rule is prone to erroneous merges. The
k-tails technique described above forms the basis for the majority of current
software state machine inference techniques. Its weakness is due to the fact
that just because two states share a tail of length k does not necessarily mean
that they are equivalent. Furthermore, a pair of states that do not share a
k-tail are also not necessarily distinct. This is particularly notable when the
set of traces is underpopulated (as is usually the case in a practical reverse-
engineering scenario). Finally its use depends on the selection of some value k,
which is critical to the outcome, but for which there is no way of determining
which value is correct. Higher values may lead to the avoidance of erroneous
merges but also rule out merges that should happen. Conversely, lower values
may lead to erroneous merges, but will end up incorporating valid merges as
well.
(3) The selection of traces is critical to the accuracy of the in-
ferred machine. If the set of traces is incomplete, it is impossible for an
algorithm to ensure that those states that are merged are in fact equivalent
(and that unmerged states are actually distinct). The problem is that, in the
context of the conventional state machine inference scenario, there is no no-
tion of what constitutes a complete set of traces. In his influential work on
grammar inference, Gold proved [16] that the inference of a grammar from
positive samples alone was undecidable { a principle that also applies to our
state machine inference scenario (his proof is succinctly summarized by Pul-
lum [29]). Thus, a complete set of traces must somehow include traces that are
impossible or invalid, which belong to the proscribed language of the software
system. For certain state-merging algorithms such as the RPNI algorithm [26]
(which accounts for negative samples as well as positive ones), it has been
shown that a set of traces is complete if they are characteristic of the target
system. Informally, this means that the set of traces must contain traces that
cover every transition in the target system, as well as a suciently large set
of positive and negative samples to distinguish between every pair of distinct
states to avoid erroneous merges. A characteristic sample is usually vast and,
even if we do find a way to collect negative traces, identifying and collecting a
 
Search WWH ::




Custom Search