Databases Reference
In-Depth Information
Information retrieval and descriptive methods basically cover the text-based and
bibliographic retrieval approaches introduced before. While the former focus on
keyword matching “within” the artifacts, the latter usually rely on externally ob-
tained metadata (such as language, domain etc.) for retrieval. In general, Mili et al.
characterize descriptive methods as a subset of information retrieval methods, but
since this family of approaches was so widely used at the time of their survey they
decided to list it as a separate category. Denotational semantics methods drive the
retrieval process based on signatures (see e.g. [ 11 ]) resp. formal specifications [ 13 ]
of the indexed assets. While signature matching is considered useful in practice,
software retrieval based on formal specifications suffers from a variety of disadvan-
tages. For instance, the specifications are difficult to create and evaluate (e.g. due
to the complexity of the associated decision problems) and cause a significant cre-
ation and maintenance overhead. Operational semantics approaches use exemplary
input values (or so-called “samples” [ 30 ]) to execute syntactically matching arti-
facts contained in a collection. Although they are quite expensive to execute they
have recently received a lot of attention in association with test-driven reuse ap-
proaches (as described in another chapter of this topic [ 10 ]). Structural methods do
not deal with the code of the assets directly but rather with internal program pat-
terns or designs. Since, the formulation of queries for approaches of this class is not
yet well understood, it remains an academic research area for the time being. The
common property of the topological methods , the sixth group listed by Mili et al.,
is that they calculate some kind of “distance” between the query and the results.
Hence, today, they would be better classified as approaches supporting the ranking
of search results.
5.2.2 Limitations
Although this basic classification provided a good starting point, it quickly be-
came clear that modern software collections containing potentially millions of (open
source) artifacts quickly stretch these traditional methods to their limits. Not only
that, since manual indexing is impossible with collections of this size, the precision
of the above approaches is simply not sufficient. Suppose, for example, that a text-
based software search engine is requested to find a reusable stack data structure.
Simply searching for the string stack within the indexed artifacts typically delivers
thousands of results that merely contain this string somewhere in their source code.
Thus, many of the delivered results will not actually be stacks but may merely use
a stack somewhere in their implementation. The same holds true for pure signa-
ture matching techniques that can also deliver thousands of results for sufficiently
generic signatures (more examples and some preliminary investigation results on
this can be found in a previous publication [ 26 ]).
In order to support more practical software search use cases (as e.g. listed by
Sim et al. [ 12 ] or more recently by Janjic et al. [ 31 ]), more precise and specialized
query possibilities are urgently required. In particular, more sophisticated types of
queries that allow the form of software artifacts to be taken into account are needed.
Search WWH ::




Custom Search