Artifact Representation Techniques for Large-Scale Software Search Engines - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Information retrieval and descriptive methods basically cover the text-based and

bibliographic retrieval approaches introduced before. While the former focus on

keyword matching “within” the artifacts, the latter usually rely on externally ob-

tained metadata (such as language, domain etc.) for retrieval. In general, Mili et al.

characterize descriptive methods as a subset of information retrieval methods, but

since this family of approaches was so widely used at the time of their survey they

decided to list it as a separate category. Denotational semantics methods drive the

retrieval process based on signatures (see e.g. [ 11 ]) resp. formal specifications [ 13 ]

of the indexed assets. While signature matching is considered useful in practice,

software retrieval based on formal specifications suffers from a variety of disadvan-

tages. For instance, the specifications are difficult to create and evaluate (e.g. due

to the complexity of the associated decision problems) and cause a significant cre-

ation and maintenance overhead. Operational semantics approaches use exemplary

input values (or so-called “samples” [ 30 ]) to execute syntactically matching arti-

facts contained in a collection. Although they are quite expensive to execute they

have recently received a lot of attention in association with test-driven reuse ap-

proaches (as described in another chapter of this topic [ 10 ]). Structural methods do

not deal with the code of the assets directly but rather with internal program pat-

terns or designs. Since, the formulation of queries for approaches of this class is not

yet well understood, it remains an academic research area for the time being. The

common property of the topological methods , the sixth group listed by Mili et al.,

is that they calculate some kind of “distance” between the query and the results.

Hence, today, they would be better classified as approaches supporting the ranking

of search results.

5.2.2 Limitations

Although this basic classification provided a good starting point, it quickly be-

came clear that modern software collections containing potentially millions of (open

source) artifacts quickly stretch these traditional methods to their limits. Not only

that, since manual indexing is impossible with collections of this size, the precision

of the above approaches is simply not sufficient. Suppose, for example, that a text-

based software search engine is requested to find a reusable stack data structure.

Simply searching for the string stack within the indexed artifacts typically delivers

thousands of results that merely contain this string somewhere in their source code.

Thus, many of the delivered results will not actually be stacks but may merely use

a stack somewhere in their implementation. The same holds true for pure signa-

ture matching techniques that can also deliver thousands of results for sufficiently

generic signatures (more examples and some preliminary investigation results on

this can be found in a previous publication [ 26 ]).

In order to support more practical software search use cases (as e.g. listed by

Sim et al. [ 12 ] or more recently by Janjic et al. [ 31 ]), more precise and specialized

query possibilities are urgently required. In particular, more sophisticated types of

queries that allow the form of software artifacts to be taken into account are needed.

Search WWH ::

Custom Search

Home