Infrastructure for Building Code Search Applications for Developers - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

Figure 8.2 shows Sourcerer's relational model using an ER-diagram. It shows the

five elements of Sourcerer's relational model and a set of attributes for each of them.

Tab le 8.3 provides the details on all the attributes of the model elements. Figure 8.2

and Table 8.3 provide information on how the model elements are linked with each

other, and how the attributes in the relational model link the relational model ele-

ments with the storage model. For example, Project element's 'path' attribute links

it to the physical location defined by the storage model.

Various tools in Sourcerer make use of this information to connect the relational

information with the textual contents stored in the physical files.

Entities and Relations are the key elements of the Sourcerer's relational

model that enables code specific search capabilities. Capturing and asso-

ciating fully qualified names for code entities allows referring and look-

ing up code entities across projects using the FQNs as keys. Therefore,

FQNs for entities enables analysis of relations across projects. This led to

innovative use of structural information in code search applications such

as: (i) computing CodeRank (adaptation of Google's Pagerank algorithm

on code graph) and using it as a ranking heuristic in SCSE, (ii) and using

feature vectors made up of FQNs of used entities as a basis to compute

usage similarity for entities in SSI.

8.4.3 Index Model

The Index Model complements Sourcerer's relational model by facilitating appli-

cation of information retrieval techniques on the code entities. The index model

specifies a Document representation for each code Entity in the relational model.

A document in the index model is made up of a collection of Field s. Each field has

a name and different types of values associated with them, the most fundamental

being a collection of Term s. A term is a basic unit for search/retrieval. Terms are

extracted from various parts of an entity, and stored in a corresponding field of a

document representing a code entity.

Sourcerer's information retrieval component is based on the popular Lucene [ 41 ]

information retrieval engine. Therefore, its index model confirms to how Lucene

models its contents. More details on Lucene's contents model are available in [ 25 ].

Fields in Sourcerer's index models can be categorized into five types:

1. Fields for basic retrieval that store terms coming from various parts of a code

entity.

2. Fields for retrieval with signatures that store terms coming from method signa-

tures and also terms that indicate number of arguments a method has.

3. Fields storing metadata , for example the type of the entity, so that a search could

be limited to one or more types of entities.

Search WWH ::

Custom Search

Home