Databases Reference
In-Depth Information
12.2.1 State of the Art
As previously indicated, the premises for software search and retrieval have changed
considerably since the Internet and its users have made millions of open source com-
ponents [ 16 ] available for software developers and researchers alike in recent years.
Around the turn of the millennium, Seacord realized this potential and attempted to
fill a software repository with Java applets collected on the Web by an automated
crawler [ 36 ]. Also at that time, Ye was one of the first researchers recognizing that
software searches are not only hindered by technical weaknesses of search engines,
but by usability issues as well. He found that developers are often not even aware of
the chance that a reuse candidate might be stored in a repository (which was under-
standable though due to their relatively small size at that time) and hence proposed
and implemented a prototypical software search system (called CodeBroker) that
continuously monitors the work of a developer and proactively presents potentially
reusable candidates based on textual information from the comments athe developer
has been writing [ 41 ].
Also around that time, the World Wide Web witnessed the rise of large-scale
search engines helping to make its growing amount of data accessible. Inspired by
the success of Google's PageRank algorithm [ 29 ], it was the ComponentRank ap-
proach of Inoue et al. [ 22 ] that breathed new life into the software retrieval commu-
nity with an automated search engine (known as Spars-J). While their basic retrieval
approach was still text-based and hence simple, it was their set of about 150,000
open source files that was far larger than every other collection before, together
with the clever ranking approach that created a new standard. Inoue et al. proposed
to rank those components higher in the result list of a search that are more often
used than others amongst the indexed files. Nevertheless, the overall precision of
the searches remains still too low from a specification-based reuse perspective as
long as merely keyword matching is applied. Almost simultaneously, Hummel and
Atkinson [ 16 ] demonstrated that general web search engines (such as Google) could
be used for software searches by enriching queries with special keywords (such as:
filetype:java AND “class stack”) that - though not working absolutely perfectly -
still delivers relevant source code with a high hit ratio.
However, although all seminal search approaches described before were avail-
able at that time, little work is known that would have tried to integrate them with
the upcoming large-scale software search engines described in the next subsection.
Consequently, a pure text-based retrieval still remained state of the art at that time.
The only visible progress was the idea of parsing source codes in order to extract the
names of objects and their methods to allow more focused searches for them (as e.g.
introduced by Koders.com). Hummel et al. have coined the term name-based re-
trieval for that technique [ 17 ]. Retrieval approaches such as signature matching [ 42 ]
or interface-based retrieval - the combination of signature and name-based retrieval
(also described in [ 17 ]) - did not find their way into any of this new generation of
software search engines. Numerous of them have been developed during the last
10 years and a good number is still available on the World Wide Web. As demon-
strated by the various software search engines that have been launched as well as
Search WWH ::




Custom Search