Infrastructure for Building Code Search Applications for Developers - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

the edges in the graph. After constructing this graph, code ranker applies Google's

Pagerank [ 15 ] algorithm on top of this graph to compute the Pagerank (called

CodeRank) for each entity which can be used as a measure of popularity of a code

entity in the code graph. SCSE used the value of CodeRank as one of the heuristics

to rank retrieved results.

8.8 Summary

The combination of models, services, and tools makes Sourcerer a unique infras-

tructure supporting three different code search applications. Going back to the re-

quirements that were listed (in Sect. 8.2 ) for the three code search applications, we

can summarize how Sourcerer meets these requirements.

SCSE: The storage model, stored contents, and the crawler in Sourcerer allowed

collection of source code from large number of open source repositories, and store

them locally making available for required further processing. The relational model

and the code parser tool allowed fine grained parsing and storing parsed informa-

tion in a readily available form. Being able to parse source code allowed storing

and retrieving source code at the level of finer entities such as classes and methods.

Using fully qualified names as keys for entities, and following relations in Sourcer-

erDB, SCSE provided a structure-based measure of CodeRank to rank code entities.

As discussed in the index model, several code-specific heuristics were supported to

build retrieval schemes that were specific to source code.

CodeGenie: The semi-structure index model with fields that supported retrieval

using signatures provided basic retrieval for CodeGenie. Information about code

entities and relations between them, allowed implementation of dependency slicing

- a novel technique to extract and synthesize declaratively complete code snippet

collection for CodeGenie.

SAS: Information on entities and usage (relations such as method calls and class

extensions) allowed building API usage profiles for each code entities in the form

of feature vectors. This served as the basis for usage similarity computation among

code entities, allowing to devise novel indexing technique such as SSI using the us-

age similarity heuristic. Furthermore, full relational information on relations among

code entities allowed computing useful API usage statistics that helped implement-

ing useful snippet extraction technique.

The three code search applications were built one after another and Sourcerer

evolved as it had to support the requirements for the applications. These require-

ments can be seen as major challenges that code search infrastructure builders need

to address. A major lesson learnt with the implementation of three code search ap-

plications was that structural information provides valuable ways to build effec-

tive code search applications, and challenges inherent in building such applications

can be overcome by harnessing large collection of source code and libraries avail-

Search WWH ::

Custom Search

Home