Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

(i.e. recall and precision) of these approaches on larger collections so that inter-

ested readers are referred to their publication for further details. The estimates they

published are as follows:

1. Information retrieval methods (Recall: high/Precision: medium)

2. Descriptive methods (Recall: high/Precision: high)

3. Denotational semantics methods (Recall: high/Precision: very high)

4. Operational semantics methods (Recall: high/Precision: very high)

5. Structural methods (Recall: very high/Precision: very high)

6. Topological methods (Recall: unknown 1 /Precision: unknown)

Software retrieval is a specialization of information retrieval and hence it makes

sense to reuse methods from the latter area to perform a simple, purely text-based

retrieval of software assets. Descriptive methods go a small step further and rely on

external textual descriptions (i.e. metadata) for an asset. Hence, Mili et al. denote

such descriptive methods as a subset of the information retrieval methods, but due

to the high use of this approach in practice and literature they created an additional

category. Denotational semantics methods use signatures (see e.g. [ 42 ]) or formal

specifications [ 43 ] of the indexed assets for retrieval. While signature matching is

widely seen as a practical tool in this context, as it uses the parameters and re-

turn values exhibited in the interface of an artifact for matching, software retrieval

based upon the matching of formal specifications suffers from a variety of disadvan-

tages (such as difficulties in creating and evaluating them). Operational semantics

approaches that rely on the execution of the indexed software with sample input val-

ues are certainly expensive to execute, however, they seem to be easily automatable.

Nevertheless, also appealing in theory, this approach definitively also comes with

some practical challenges: side effects, non-termination, the structure of used data

types, dependencies, etc. can cause serious problems. Hence, in this context, it is

no surprise that the most well-known implementation so far, called Behavior Sam-

pling [ 30 ], was merely applied to simple mathematical functions of the C standard

library. Structural methods finally do not deal with the code of the assets directly,

but rather with internal program patterns or designs. Since it is largely unclear how

to formulate queries for such an approach, it does not surprise that it has only rarely

been experimented with.

Overlap between the discussed classifications can appear at various places, e.g.

between (3) and (4) and (5) as the “sampling” of components typically needs a

specific signature or structure to work with. As visible in the list, Mili et al. still

defined topological methods as an independent class of approaches, however, since

their common denominator is the distance between the query and the candidates, we

would prefer to describe it as an approach for ranking search results that can (exclu-

sively) be used together with at least one concrete instance of the other approaches.

1 For topological methods it is difficult to define or estimate recall and precision. See [ 26 ]formore.

Search WWH ::

Custom Search

Home