Information Technology Reference
In-Depth Information
has a term independent assumption, which says the occurrences of one term can be
regarded as independent from the occurrences of another term. However, it may not
be the case.
When dealing with text documents, a commonly encountered problem is known
as the vocabulary mismatch problem. In essence, people may choose different
vocabulary to describe the same thing.
There are two aspects to the problem. First, there is a tremendous diversity in the
words people use to describe the same object or concept; this is called
synonymy
.
Users in different contexts, or with different needs, knowledge or linguistic habits
will describe the same information using different terms. For example, it has been
demonstrated that any two people choose the same main keyword for a single, well-
known object less than 20 % of the time on average. Indeed, this variability is much
greater than commonly believed and this places strict, low limits on the expected
performance of word-matching systems.
The second aspect relates to
polysemy
, a word having more than one distinct
meaning. In different contexts or when used by different people the same word
takes on varying referential significance (e.g., “bank” in river bank versus “bank” in
a savings bank). Thus the use of a term in a search query does not necessarily mean
that a text object containing or labeled by the same term is of interest. Because
human word use is characterized by extensive synonymy and polysemy, straight-
forward term-matching schemes have serious shortcomings - relevant materials
will be missed because different people describe the same topic using different
words and, because the same word can have different meanings, irrelevant material
will be retrieved. The basic problem is that people want to access information
based on meaning, but the words they select do not adequately express intended
meaning. Previous attempts to improve standard word searching and overcome the
diversity in human word usage have involved: restricting the allowable vocabulary
and training intermediaries to generate indexing and search keys; hand-crafting
thesauri to provide synonyms; or constructing explicit models of the relevant domain
knowledge. Not only are these methods expert-labor intensive, but also they are
often not very successful.
Latent Semantic Indexing (LSI) is designed to overcome the vocabulary mis-
match problem faced by information retrieval systems (Deerwester et al.
1990
;
Dumais
1995
). Online services of LSI are available, for example,
http://lsa.colorado.
the conceptual topic or meaning of a document. LSI assumes the existence of
some underlying semantic structure in the data that is partially obscured by the
randomness of word choice in a retrieval process, and that the latent semantic
structure can be more accurately estimated with statistical techniques.
In LSI, a semantic space is constructed based on a large matrix of term-document
association observations. LSI uses a mathematical technique called Singular Value
Decomposition (SVD). One can approximate the original, usually very large, term
by document matrix by a truncated SVD matrix. A proper truncation can remove
noise data from the original data as well as improve the recall and precision of
information retrieval.
Search WWH ::
Custom Search