Artifact Representation Techniques for Large-Scale Software Search Engines - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

(Open source) license

0...1

Controlled

license

lictype

Classification according to copyleft

1

Enumerated

Author(s) of the artifact

0...*

Free text

author

Since an artifact will usually require fields that have a multiplicity larger than 1

(for instance for storing the operation names of an artifact) it is useful that Lucene

allows several entries per document to be stored with the same field name. During

the indexing process the content of each of the above fields is tokenized. This means

that before it is actually indexed it is fed to Lucene's standard analyzer that separates

it into a stream of searchable tokens (i.e. typically words). Since software developers

often concatenate words to create more expressive operation or variable names using

approaches such as “camel case”, enhancing the Lucene analyzer to reverse this

process may improve search quality (see Bajracharya et al. [ 24 ], for example).

In order to avoid maintaining an extra database with additional information about

the indexed artifacts that can be used for result presentation (as well as more ad-

vanced searches, as e.g. implemented in Sourcerer [ 20 ]), it makes sense to store a

number of additional fields that are not directly used in the search process. Exam-

ples include the date when a document was added, various source code metrics and

a unique hash value that allows simple duplicate recognition. Furthermore, since

Lucene used to “destroy” all formatting information (i.e. upper and lower casing),

free text fields that are of interest to the user need to be stored a second time in a

non-tokenized and non-searchable way for optimized result presentation. Although

this practice increases the index size considerably, the impact on search speed is

negligible. Fortunately, more recent versions of Lucene are even able to handle this

internally so that copying the fields is no longer necessary. With this relatively sim-

ple index structure, the full performance of Lucene's query engine (as e.g. described

in [ 16 ]) including wildcards, range queries etc. is available on the tokenized fields

and can be used to support keyword based retrieval of software artifacts (see [ 36 ]

for the full Merobase Lucene index built on these principles).

5.5 Advanced Representation Techniques

As mentioned above, Lucene allows values to be stored in different fields and

thus supports faceted retrieval approaches (as we discussed in Sect. 5.4 ) out of the

box. This makes it possible to index the source code of software artifacts and to

enhance records with metadata such as the artifact's language, project environ-

ment, documentation, etc. The ranking of search results is also included so that

the best matching result is always delivered first. The drawback, however, is that

Lucene's fields cannot be relationally connected as in a database, which makes

it difficult to search for operation signatures, for example. Queries such as “give

me all artifacts containing two methods add and sub receiving two int parameters

and returning an int ” are thus not directly feasible. Using Boolean operators makes

it possible to concatenate fields from simple searches for individual operation in-

Finding Source Code on the Web for Remix and Reuse

Search WWH ::

Custom Search

Home