Databases Reference
In-Depth Information
the edges in the graph. After constructing this graph, code ranker applies Google's
Pagerank [ 15 ] algorithm on top of this graph to compute the Pagerank (called
CodeRank) for each entity which can be used as a measure of popularity of a code
entity in the code graph. SCSE used the value of CodeRank as one of the heuristics
to rank retrieved results.
8.8 Summary
The combination of models, services, and tools makes Sourcerer a unique infras-
tructure supporting three different code search applications. Going back to the re-
quirements that were listed (in Sect. 8.2 ) for the three code search applications, we
can summarize how Sourcerer meets these requirements.
SCSE: The storage model, stored contents, and the crawler in Sourcerer allowed
collection of source code from large number of open source repositories, and store
them locally making available for required further processing. The relational model
and the code parser tool allowed fine grained parsing and storing parsed informa-
tion in a readily available form. Being able to parse source code allowed storing
and retrieving source code at the level of finer entities such as classes and methods.
Using fully qualified names as keys for entities, and following relations in Sourcer-
erDB, SCSE provided a structure-based measure of CodeRank to rank code entities.
As discussed in the index model, several code-specific heuristics were supported to
build retrieval schemes that were specific to source code.
CodeGenie: The semi-structure index model with fields that supported retrieval
using signatures provided basic retrieval for CodeGenie. Information about code
entities and relations between them, allowed implementation of dependency slicing
- a novel technique to extract and synthesize declaratively complete code snippet
collection for CodeGenie.
SAS: Information on entities and usage (relations such as method calls and class
extensions) allowed building API usage profiles for each code entities in the form
of feature vectors. This served as the basis for usage similarity computation among
code entities, allowing to devise novel indexing technique such as SSI using the us-
age similarity heuristic. Furthermore, full relational information on relations among
code entities allowed computing useful API usage statistics that helped implement-
ing useful snippet extraction technique.
The three code search applications were built one after another and Sourcerer
evolved as it had to support the requirements for the applications. These require-
ments can be seen as major challenges that code search infrastructure builders need
to address. A major lesson learnt with the implementation of three code search ap-
plications was that structural information provides valuable ways to build effec-
tive code search applications, and challenges inherent in building such applications
can be overcome by harnessing large collection of source code and libraries avail-
Search WWH ::




Custom Search