Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Lucene framework [ 14 ] and currently contains about ten million files from well-

known open source hosting sites and the open Web (roughly 8 %), out of which

roughly 40 % are binary files (primarily Java archives, but some .NET binaries as

well). Special parsers for each supported programming language allow to extract

syntactical information, store it in the index and search for it later. In addition to

class and method names, we store operation signatures (i.e. parameter and return

types) and complete operation headers (i.e., operation signatures plus names) as

concatenated terms optimized for Lucene in the index. Details on their structure can

be found in another chapter of this topic [ 5 ]. Currently, Merobase is able to work

with Java, C++ and C# sources, WSDL files, binary Java classes from Java archives

(JARs) and .NET binaries.

Whenever a user sends a request to the Merobase server (either through the web-

interface available at merobase.com or a client program like Code Conjurer access-

ing its web service based API), the above parsers and a special JUnit parser (able to

extract the interface of the class under test from test cases) are invoked and try to ex-

tract as much syntactic information from the query as possible. If none of the parsers

recognizes parsable code, however, a simple keyword search is executed. Based on

parsed syntactic information, Merobase supports retrieval by class and operation

names, signature matching and by matching the full interface of classes as described

before. Although preliminary results indicate that the latter indeed leads to a higher

precision with common “toy examples” [ 17 ] collected from the literature, the risk

of “over-specifying” desired components is certainly also real, as e.g. the previous

spreadsheet example has demonstrated: no candidate completely matched the rel-

atively simple interface we have specified. Nevertheless, the retrieved components

that were finally working successfully, were found amongst roughly 22,000 results

of a “relaxed” query that merely searched for the desired signatures (i.e. ignored

class and operation names in the interface). As searches for more complex inter-

faces often tend to deliver few results (as e.g. predicted by Crnkovic [ 6 ]), we have

integrated a number of strategies into Merobase for relaxing queries as well. Fur-

ther details on the index structure of Merobase, its content, and the applied matching

strategies are explained in another chapter of this topic [ 5 ].

In case of a test-driven search, which is triggered when a JUnit test case (such

as the one in Listing 12.1 ) is submitted, Merobase automatically tries to compile,

adapt and test the highest ranked candidates. If a candidate is relying on additional

classes, the algorithm uses dependency information to locate them as well (as seen in

the spreadsheet example). As visible in Fig. 12.4 , the actual compilation and testing

are not carried out on the search server itself, but on dedicated virtual machines

within sandboxes. These ensure that the executed code does not have the possibility

to do anything harmful to the user's system or bring the whole testing-environment

down; in our publicly available system we have also deactivated network transfer

to prevent abuse. Another system continuously monitors the virtual machines (by

polling a special monitoring service provided by the sandboxes) and as soon as it

recognizes that one is not working properly, it simply replaces it with a new instance,

which takes about 30 s for replacing and restarting.

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home