Artifact Representation Techniques for Large-Scale Software Search Engines - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

terfaces (as proposed for Sourcerer [ 24 ], for instance), but this is still too limited

for precise signature- and API-based searches. Consider the previous request as

an example. Even if there were additional fields for parameters and return types

in the same style as before, the maximum possible precision of a query of the form -

method:add AND method:sub AND param:int AND return:int

would retrieve all artifacts that have a method add, a method sub and any arbi-

trary method with at least one int parameter and one with int as return type. The

reason for this behavior is - as indicated before - that fields cannot be related to one

another in an FTSF. In other words, because no relational information is present in

the index, the search engine cannot work out whether the specified operation names

and the desired types belong to the same operation. Hence, no higher-level struc-

tural information related to the structure of software artifacts (e.g. class structure,

methods structure, parameter structure etc.) can be utilized for searching. Clearly,

such structural information would be directly available in an RDBMS system (as

e.g. additionally used in Sourcerer for this purpose, cf. [ 35 ] in chapter 8 of this

topic), but for search engines based on an FTSF index alone it is not automatically

present.

Therefore, to support more sophisticated, structure-based queries it is necessary

to employ additional techniques that capture important code-related structure within

the facetted storage approach that underlies FTSF-based indices such as those pre-

sented before. Many superficially appealing techniques do not work in practice on

closer analysis, however. For example, why not directly store complete operation

interfaces (such as e.g. public int doSomethingUseful(int i, String s) ) extracted from

a programming language parser in order to overcome the challenge just described?

Unfortunately, such a relatively simple implementation is not possible since tok-

enization during the indexing would tear the operation signature apart into individ-

ual tokens, which would make it impossible to subsequently distinguish between the

various elements again. The only apparent way to overcome this problem is by forc-

ing the FTSF to deliver only exact matches. However, even if the parameter names

were ignored through the help of wildcards, or not stored at all, the FTSF would

still deliver only those results with identical parameter order so that searches would

still be limited. Another approach that worked in theory was supported by Google

Code Search (which was closed down in early 2012 [ 23 ], however). This not only

supported simple wildcards, but full regular expressions that allowed users to spec-

ify appropriate searches. However, the resulting queries for that purpose were so

complicated that they were barely usable in practice (cf. [ 25 ]).

It is, nevertheless, possible to apply some relatively simple “tricks” to sig-

nificantly enhance the sophistication of the queries that can be supported on a

FTSF-based search engine. In the following we explain how we enhanced the cur-

rent version of the Merobase search engine in this way. The basic idea is to store

operation signatures in a “flattened” form (in database terminology this might be

called a “de-normalized” form) in non-tokenized fields in a way that ignores pa-

rameter orders, i.e. that yields the same entry for a method no matter in what order

the parameters appear (i.e. int, String or String, int ). This is possible by sorting

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home