Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

12.2.1 State of the Art

As previously indicated, the premises for software search and retrieval have changed

considerably since the Internet and its users have made millions of open source com-

ponents [ 16 ] available for software developers and researchers alike in recent years.

Around the turn of the millennium, Seacord realized this potential and attempted to

fill a software repository with Java applets collected on the Web by an automated

crawler [ 36 ]. Also at that time, Ye was one of the first researchers recognizing that

software searches are not only hindered by technical weaknesses of search engines,

but by usability issues as well. He found that developers are often not even aware of

the chance that a reuse candidate might be stored in a repository (which was under-

standable though due to their relatively small size at that time) and hence proposed

and implemented a prototypical software search system (called CodeBroker) that

continuously monitors the work of a developer and proactively presents potentially

reusable candidates based on textual information from the comments athe developer

has been writing [ 41 ].

Also around that time, the World Wide Web witnessed the rise of large-scale

search engines helping to make its growing amount of data accessible. Inspired by

the success of Google's PageRank algorithm [ 29 ], it was the ComponentRank ap-

proach of Inoue et al. [ 22 ] that breathed new life into the software retrieval commu-

nity with an automated search engine (known as Spars-J). While their basic retrieval

approach was still text-based and hence simple, it was their set of about 150,000

open source files that was far larger than every other collection before, together

with the clever ranking approach that created a new standard. Inoue et al. proposed

to rank those components higher in the result list of a search that are more often

used than others amongst the indexed files. Nevertheless, the overall precision of

the searches remains still too low from a specification-based reuse perspective as

long as merely keyword matching is applied. Almost simultaneously, Hummel and

Atkinson [ 16 ] demonstrated that general web search engines (such as Google) could

be used for software searches by enriching queries with special keywords (such as:

filetype:java AND “class stack”) that - though not working absolutely perfectly -

still delivers relevant source code with a high hit ratio.

However, although all seminal search approaches described before were avail-

able at that time, little work is known that would have tried to integrate them with

the upcoming large-scale software search engines described in the next subsection.

Consequently, a pure text-based retrieval still remained state of the art at that time.

The only visible progress was the idea of parsing source codes in order to extract the

names of objects and their methods to allow more focused searches for them (as e.g.

introduced by Koders.com). Hummel et al. have coined the term name-based re-

trieval for that technique [ 17 ]. Retrieval approaches such as signature matching [ 42 ]

or interface-based retrieval - the combination of signature and name-based retrieval

(also described in [ 17 ]) - did not find their way into any of this new generation of

software search engines. Numerous of them have been developed during the last

10 years and a good number is still available on the World Wide Web. As demon-

strated by the various software search engines that have been launched as well as

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home