Databases Reference
In-Depth Information
terfaces (as proposed for Sourcerer [ 24 ], for instance), but this is still too limited
for precise signature- and API-based searches. Consider the previous request as
an example. Even if there were additional fields for parameters and return types
in the same style as before, the maximum possible precision of a query of the form -
method:add AND method:sub AND param:int AND return:int
would retrieve all artifacts that have a method add, a method sub and any arbi-
trary method with at least one int parameter and one with int as return type. The
reason for this behavior is - as indicated before - that fields cannot be related to one
another in an FTSF. In other words, because no relational information is present in
the index, the search engine cannot work out whether the specified operation names
and the desired types belong to the same operation. Hence, no higher-level struc-
tural information related to the structure of software artifacts (e.g. class structure,
methods structure, parameter structure etc.) can be utilized for searching. Clearly,
such structural information would be directly available in an RDBMS system (as
e.g. additionally used in Sourcerer for this purpose, cf. [ 35 ] in chapter 8 of this
topic), but for search engines based on an FTSF index alone it is not automatically
present.
Therefore, to support more sophisticated, structure-based queries it is necessary
to employ additional techniques that capture important code-related structure within
the facetted storage approach that underlies FTSF-based indices such as those pre-
sented before. Many superficially appealing techniques do not work in practice on
closer analysis, however. For example, why not directly store complete operation
interfaces (such as e.g. public int doSomethingUseful(int i, String s) ) extracted from
a programming language parser in order to overcome the challenge just described?
Unfortunately, such a relatively simple implementation is not possible since tok-
enization during the indexing would tear the operation signature apart into individ-
ual tokens, which would make it impossible to subsequently distinguish between the
various elements again. The only apparent way to overcome this problem is by forc-
ing the FTSF to deliver only exact matches. However, even if the parameter names
were ignored through the help of wildcards, or not stored at all, the FTSF would
still deliver only those results with identical parameter order so that searches would
still be limited. Another approach that worked in theory was supported by Google
Code Search (which was closed down in early 2012 [ 23 ], however). This not only
supported simple wildcards, but full regular expressions that allowed users to spec-
ify appropriate searches. However, the resulting queries for that purpose were so
complicated that they were barely usable in practice (cf. [ 25 ]).
It is, nevertheless, possible to apply some relatively simple “tricks” to sig-
nificantly enhance the sophistication of the queries that can be supported on a
FTSF-based search engine. In the following we explain how we enhanced the cur-
rent version of the Merobase search engine in this way. The basic idea is to store
operation signatures in a “flattened” form (in database terminology this might be
called a “de-normalized” form) in non-tokenized fields in a way that ignores pa-
rameter orders, i.e. that yields the same entry for a method no matter in what order
the parameters appear (i.e. int, String or String, int ). This is possible by sorting
Search WWH ::




Custom Search