Databases Reference
In-Depth Information
only maximal subsequences, i.e., it must be the case that every sequence(si) i )
in S is not a subsequence of any other sequence present in S.
For example, if the set of sequences is given by if ( a b c e ),
( a d c e ),( d a c e ),( a c d e f ),
( e f d c a ) g, a sequence miner detects ( a c e ) as a
frequently occurring subsequence. Observe that the same set of transactions
without ordering in frequent itemset mining would generate the set f a,c,
d,e g. For our application, a transaction corresponds to a call-site and the
sequence within a transaction corresponds to the sequence of procedure calls
that occurred before the call-site. Our implementation uses the Apriori-all al-
gorithm by Agrawal and Srikant [2], which is known to scale to over a million
sequences. For the example shown in Figure 9.8, we generate the specification
allocbuf lock readbuf .
9.2.5.2
The Structural Similarity Problem
In Figure 9.8, the names of the variables are the same across multiple
call-sites whereas this does not hold in real programs (as noted earlier in
this section). In other words, as discussed earlier, predicates computed along
different paths may share structural, if not syntactic, similarities. In order
to capture such similarities, a technique to determine the locations, names,
values, etc. that can be abstracted uniformly among different sets is necessary.
Consider every predicate expression as being mapped to a set of locations.
Thus, assume a set of location sets,
fL 1 = f` 11 ;` 12 ;:::` 1m 1 g;L 2 = f` 21 ;` 22 ;:::` 2m 2 g; :::;L k = f` k1 ;` k2 ;:::` 1m k gg
where L i corresponds to locations associated with predicates that reach
call-site i of procedure P. Now, for every element in Li, i , we wish to find a
corresponding element in every other L j such that the cumulative matching
of the attribute sets for such a mapping is maximal. Given three sets A;B;
and C, we say A and B match maximally, if and only if j A\B j is greater
than j A\C j or j B\C j.
Theorem 1. The maximal matching problem as stated above is NP-hard.
Proof By reduction from maximal bipartite (k;)-clique in a bipartite graph
[23, 45].
Fortunately, there are a number of heuristics that can be employed to
map locations based on semantic information available within programs. We
describe below heuristics that match the attribute sets across multiple call-
sites that we have used in our implementation.
Type: Attribute sets can be divided based on the type of variable. e.g.,
two variables, x and y with attributes [ x : f(:=, true )g] and [ y : f(>,
20)g] can never be matched.
 
Search WWH ::




Custom Search