Databases Reference
In-Depth Information
One term-based approach is TF-IDF, which stands for term frequency - inverse
document frequency . In TF-IDF, each document is broken into a collection of terms,
and each of the terms is associated with the number of times it occurs in that
document. Terms are then weighted according to how common they are across the
corpus, the intuition being that rare terms are more central to the meaning of a doc-
ument than terms that occur regularly. To search the corpus, the user provides a list
of terms, which are matched against the collection of terms. Documents are ranked
according to how many of the searched terms they contain, and how common those
terms are. While TF-IDF is not the only statistical method for ranking documents,
it sees widespread use due to its perceived quality. Apache Lucene, an open-source
text indexing platform, uses TF-IDF as one of its primary ranking methods.
Statistical approaches for topic modeling are also be used to improve term-based
searching. Topic modeling groups together terms according to identified topics,
which allows the terms to be used somewhat interchangeably. So if a developer
searches for print , the search system can also return results relating to output .
Latent semantic indexing (LSI) and latent Dirichlet allocation (LDA) are two ap-
proaches for topic modeling, and we direct interested readers to Berry and Kogan's
Text Mining: Applications and Theory [ 4 ].
Term-based search methods have a number of advantages over full text search.
They can provide results ordered by relevance. This dramatically increases the us-
ability of the search systems, especially if searches could potentially return thou-
sands of results. Through topic modeling, term-based searches can also handle the
use of synonyms, which can cause significant problems if the vocabulary for a given
search isn't entirely clear.
These advantages come at a cost. Term-based methods generally require an index
to be created in advance of any searching, which can be time consuming, especially
for large input. This does make individual searches faster than their full text equiv-
alents, however. Another issue is that if the ranking is poor, ranked results become
significantly less useful than unranked results. If users mistakenly trust a poor rele-
vance ordering, they will fail to notice meaningful results.
Returning to our running example, let's look again at the developer searching
for instances where the toByteArray method is called for ByteArrayOutput
Stream . Term-based searching simplifies the query dramatically, as now the devel-
oper can simply enter those two terms and the system will return examples where
both terms are present ranked at the top. Due to this approach being purely text-
based, the risk remains that an unrelated toByteArray method might be refer-
enced, and not the one associated with ByteArrayOutputStream .
11.4 Structured Text Search
Every term in a document is not equally central to that document's meaning. This is
the central insight behind TF-IDF, which uses different measures of frequency to de-
termine a term's importance. Yet frequency is not the only method for determining
Search WWH ::




Custom Search