Databases Reference
In-Depth Information
(Open source) license
0...1
Controlled
license
lictype
Classification according to copyleft
1
Enumerated
Author(s) of the artifact
0...*
Free text
author
Since an artifact will usually require fields that have a multiplicity larger than 1
(for instance for storing the operation names of an artifact) it is useful that Lucene
allows several entries per document to be stored with the same field name. During
the indexing process the content of each of the above fields is tokenized. This means
that before it is actually indexed it is fed to Lucene's standard analyzer that separates
it into a stream of searchable tokens (i.e. typically words). Since software developers
often concatenate words to create more expressive operation or variable names using
approaches such as “camel case”, enhancing the Lucene analyzer to reverse this
process may improve search quality (see Bajracharya et al. [ 24 ], for example).
In order to avoid maintaining an extra database with additional information about
the indexed artifacts that can be used for result presentation (as well as more ad-
vanced searches, as e.g. implemented in Sourcerer [ 20 ]), it makes sense to store a
number of additional fields that are not directly used in the search process. Exam-
ples include the date when a document was added, various source code metrics and
a unique hash value that allows simple duplicate recognition. Furthermore, since
Lucene used to “destroy” all formatting information (i.e. upper and lower casing),
free text fields that are of interest to the user need to be stored a second time in a
non-tokenized and non-searchable way for optimized result presentation. Although
this practice increases the index size considerably, the impact on search speed is
negligible. Fortunately, more recent versions of Lucene are even able to handle this
internally so that copying the fields is no longer necessary. With this relatively sim-
ple index structure, the full performance of Lucene's query engine (as e.g. described
in [ 16 ]) including wildcards, range queries etc. is available on the tokenized fields
and can be used to support keyword based retrieval of software artifacts (see [ 36 ]
for the full Merobase Lucene index built on these principles).
5.5 Advanced Representation Techniques
As mentioned above, Lucene allows values to be stored in different fields and
thus supports faceted retrieval approaches (as we discussed in Sect. 5.4 ) out of the
box. This makes it possible to index the source code of software artifacts and to
enhance records with metadata such as the artifact's language, project environ-
ment, documentation, etc. The ranking of search results is also included so that
the best matching result is always delivered first. The drawback, however, is that
Lucene's fields cannot be relationally connected as in a database, which makes
it difficult to search for operation signatures, for example. Queries such as “give
me all artifacts containing two methods add and sub receiving two int parameters
and returning an int ” are thus not directly feasible. Using Boolean operators makes
it possible to concatenate fields from simple searches for individual operation in-
Search WWH ::




Custom Search