Applying Program Analysis to Code Retrieval - Finding Source Code on the Web for Remix and Reuse

Databases Reference

In-Depth Information

One term-based approach is TF-IDF, which stands for term frequency - inverse

document frequency . In TF-IDF, each document is broken into a collection of terms,

and each of the terms is associated with the number of times it occurs in that

document. Terms are then weighted according to how common they are across the

corpus, the intuition being that rare terms are more central to the meaning of a doc-

ument than terms that occur regularly. To search the corpus, the user provides a list

of terms, which are matched against the collection of terms. Documents are ranked

according to how many of the searched terms they contain, and how common those

terms are. While TF-IDF is not the only statistical method for ranking documents,

it sees widespread use due to its perceived quality. Apache Lucene, an open-source

text indexing platform, uses TF-IDF as one of its primary ranking methods.

Statistical approaches for topic modeling are also be used to improve term-based

searching. Topic modeling groups together terms according to identified topics,

which allows the terms to be used somewhat interchangeably. So if a developer

searches for print , the search system can also return results relating to output .

Latent semantic indexing (LSI) and latent Dirichlet allocation (LDA) are two ap-

proaches for topic modeling, and we direct interested readers to Berry and Kogan's

Text Mining: Applications and Theory [ 4 ].

Term-based search methods have a number of advantages over full text search.

They can provide results ordered by relevance. This dramatically increases the us-

ability of the search systems, especially if searches could potentially return thou-

sands of results. Through topic modeling, term-based searches can also handle the

use of synonyms, which can cause significant problems if the vocabulary for a given

search isn't entirely clear.

These advantages come at a cost. Term-based methods generally require an index

to be created in advance of any searching, which can be time consuming, especially

for large input. This does make individual searches faster than their full text equiv-

alents, however. Another issue is that if the ranking is poor, ranked results become

significantly less useful than unranked results. If users mistakenly trust a poor rele-

vance ordering, they will fail to notice meaningful results.

Returning to our running example, let's look again at the developer searching

for instances where the toByteArray method is called for ByteArrayOutput

Stream . Term-based searching simplifies the query dramatically, as now the devel-

oper can simply enter those two terms and the system will return examples where

both terms are present ranked at the top. Due to this approach being purely text-

based, the risk remains that an unrelated toByteArray method might be refer-

enced, and not the one associated with ByteArrayOutputStream .

11.4 Structured Text Search

Every term in a document is not equally central to that document's meaning. This is

the central insight behind TF-IDF, which uses different measures of frequency to de-

termine a term's importance. Yet frequency is not the only method for determining

Search WWH ::

Custom Search

Home