Artifact Representation Techniques for Large-Scale Software Search Engines - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

5.6.2 Ongoing Work

The most obvious approach for creating such a platform is to integrate a relational

database and full-text search index into a single logical product. Practically, all large

database vendors now offer full text search capabilities as part of their database

products, such as Oracle Text or MySQL Fulltext . However, performance still seems

to be an issue in this context since Linping and Lidong [ 27 ] have found in a recent

comparison of simple keyword searches in Lucene and Oracle Text that Lucene is

about an order of magnitude faster as long as the result set does not become too large

(under roughly 1,500 results). Since it is likely that a combination with various join

operations required for API-based retrieval will slow down RDBM-based retrieval

even further it is clear that pure FSTF platforms using the techniques outlined above

will still be superior for some time. An interesting practical alternative is to run a

database alongside an FTSF index on the same corpus of artifacts (as e.g. already

done in the backend of Sourcerer [ 35 ]), with software to automatically keep the

latter up to date with the former. Hibernate Search [ 22 ] is a recently developed open

source framework that achieves this by making text search available on a domain

model stored in a database by the object-relational mapper Hibernate. This offers

the user the best of both worlds in terms of querying options, but at the expense of

supporting two stores of the search base, one in the RDBMs and the other in the

FTSF (i.e. Lucence).

The database part of such a hybrid system can be searched using standard SQL

queries, thus allowing arbitrary structured searches to take advantage of its relational

structure. For example, the new schema for the upcoming version of Merobase, built

using Hibernate Search, is shown in Fig. 5.3 as a UML class diagram. A similar

scheme for the Sourcerer search engine is documented in another chapter of this

topic [ 35 ]. The main advantage of Hibernate Search is that it automatically creates,

and synchronizes, a Lucene full-text search index from the content in the RDBMs.

The overall memory required is obviously greater, but the Lucene index itself is

much leaner than the original and the new version of Merobase therefore supports

all the FTFS capabilities of the original, but allows SQL queries to be applied to the

same search base. The transaction-safe updating capabilities of the RDBS also allow

the content of the search base to be updated much more dynamically by multiple

concurrent crawlers and data mining engines.

Using a relational database it is also possible to support precise searches for

components with particular combinations of properties on-the fly, for example

searching for a specific version of a component that is used as a parameter in a

method. Consider the case of a developer who is faced with the task of learning to

use a particular framework. With an FTFS-based system it is only possible to filter

for the version of the actual search result, i.e. the file that illustrates how to use

a framework. However, searches may often deliver many unsuitable results using

outdated tutorials or documents describing the use of old versions of the framework

containing deprecated methods, or old orchestrations of component that are no

longer applicable (e.g. initialization). A software developer trying to discover how

to use the latest version of a framework would find it extremely helpful to be able

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home