Infrastructure for Building Code Search Applications for Developers - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Sourcerer's index model allows incorporating these code specific heuristics by

leveraging the semi-structured document model of Lucene. For each of the heuristics

the index model introduces a field that would store terms extracted based on the

heuristic. Each field is given an appropriate boosting value so that some heuristics

could be given higher priority (depending on the code search application). With such

an index model, a retrieval scheme for a code search application simply specifies

which fields to choose to match the user query. A different strategy to retrieve code

entities can be implemented by varying these schemes. For example, the top right

corner of Fig. 8.4 shows the code snippet for the method entity createResource

(previously shown in Fig. 8.3 ). The bottom part of Fig. 8.4 shows an index document

with five different fields capturing five different heuristics respectively. The top left

part of Fig. 8.4 shows in a tabular form, how two schemes would match the same

query create icon to the index document (and thus the method entity) differently.

Scheme 1 uses only three heuristics, compared to Scheme 2 that uses all five.

Scheme 1 looks over a limited set of terms associated with the method entity

createResource . This set only includes one of the terms create present in the

query create icon . Scheme 2 includes two more fields that makes it look over a

richer set of terms that includes both of the terms found in the query. Assuming that

all terms in query need to be matched for a document to be retrieved, Scheme 2

outperforms Scheme 1 because Scheme 2 uses additional heuristics to harvest more

meaningful words describing code entities.

Fig. 8.4 Incorporating heuristics in index model

Search WWH ::

Custom Search

Home