Databases Reference
In-Depth Information
Lucene framework [ 14 ] and currently contains about ten million files from well-
known open source hosting sites and the open Web (roughly 8 %), out of which
roughly 40 % are binary files (primarily Java archives, but some .NET binaries as
well). Special parsers for each supported programming language allow to extract
syntactical information, store it in the index and search for it later. In addition to
class and method names, we store operation signatures (i.e. parameter and return
types) and complete operation headers (i.e., operation signatures plus names) as
concatenated terms optimized for Lucene in the index. Details on their structure can
be found in another chapter of this topic [ 5 ]. Currently, Merobase is able to work
with Java, C++ and C# sources, WSDL files, binary Java classes from Java archives
(JARs) and .NET binaries.
Whenever a user sends a request to the Merobase server (either through the web-
interface available at merobase.com or a client program like Code Conjurer access-
ing its web service based API), the above parsers and a special JUnit parser (able to
extract the interface of the class under test from test cases) are invoked and try to ex-
tract as much syntactic information from the query as possible. If none of the parsers
recognizes parsable code, however, a simple keyword search is executed. Based on
parsed syntactic information, Merobase supports retrieval by class and operation
names, signature matching and by matching the full interface of classes as described
before. Although preliminary results indicate that the latter indeed leads to a higher
precision with common “toy examples” [ 17 ] collected from the literature, the risk
of “over-specifying” desired components is certainly also real, as e.g. the previous
spreadsheet example has demonstrated: no candidate completely matched the rel-
atively simple interface we have specified. Nevertheless, the retrieved components
that were finally working successfully, were found amongst roughly 22,000 results
of a “relaxed” query that merely searched for the desired signatures (i.e. ignored
class and operation names in the interface). As searches for more complex inter-
faces often tend to deliver few results (as e.g. predicted by Crnkovic [ 6 ]), we have
integrated a number of strategies into Merobase for relaxing queries as well. Fur-
ther details on the index structure of Merobase, its content, and the applied matching
strategies are explained in another chapter of this topic [ 5 ].
In case of a test-driven search, which is triggered when a JUnit test case (such
as the one in Listing 12.1 ) is submitted, Merobase automatically tries to compile,
adapt and test the highest ranked candidates. If a candidate is relying on additional
classes, the algorithm uses dependency information to locate them as well (as seen in
the spreadsheet example). As visible in Fig. 12.4 , the actual compilation and testing
are not carried out on the search server itself, but on dedicated virtual machines
within sandboxes. These ensure that the executed code does not have the possibility
to do anything harmful to the user's system or bring the whole testing-environment
down; in our publicly available system we have also deactivated network transfer
to prevent abuse. Another system continuously monitors the virtual machines (by
polling a special monitoring service provided by the sandboxes) and as soon as it
recognizes that one is not working properly, it simply replaces it with a new instance,
which takes about 30 s for replacing and restarting.
Search WWH ::




Custom Search