Mapping Associations - Mapping Scientific Frontiers: The Quest for Knowledge Visualization

Information Technology Reference

In-Depth Information

has a term independent assumption, which says the occurrences of one term can be

regarded as independent from the occurrences of another term. However, it may not

be the case.

When dealing with text documents, a commonly encountered problem is known

as the vocabulary mismatch problem. In essence, people may choose different

vocabulary to describe the same thing.

There are two aspects to the problem. First, there is a tremendous diversity in the

words people use to describe the same object or concept; this is called synonymy .

Users in different contexts, or with different needs, knowledge or linguistic habits

will describe the same information using different terms. For example, it has been

demonstrated that any two people choose the same main keyword for a single, well-

known object less than 20 % of the time on average. Indeed, this variability is much

greater than commonly believed and this places strict, low limits on the expected

performance of word-matching systems.

The second aspect relates to polysemy , a word having more than one distinct

meaning. In different contexts or when used by different people the same word

takes on varying referential significance (e.g., “bank” in river bank versus “bank” in

a savings bank). Thus the use of a term in a search query does not necessarily mean

that a text object containing or labeled by the same term is of interest. Because

human word use is characterized by extensive synonymy and polysemy, straight-

forward term-matching schemes have serious shortcomings - relevant materials

will be missed because different people describe the same topic using different

words and, because the same word can have different meanings, irrelevant material

will be retrieved. The basic problem is that people want to access information

based on meaning, but the words they select do not adequately express intended

meaning. Previous attempts to improve standard word searching and overcome the

diversity in human word usage have involved: restricting the allowable vocabulary

and training intermediaries to generate indexing and search keys; hand-crafting

thesauri to provide synonyms; or constructing explicit models of the relevant domain

knowledge. Not only are these methods expert-labor intensive, but also they are

often not very successful.

Latent Semantic Indexing (LSI) is designed to overcome the vocabulary mis-

match problem faced by information retrieval systems (Deerwester et al. 1990 ;

Dumais 1995 ). Online services of LSI are available, for example, http://lsa.colorado.

edu/ . Individual words in natural language provide unreliable evidence about

the conceptual topic or meaning of a document. LSI assumes the existence of

some underlying semantic structure in the data that is partially obscured by the

randomness of word choice in a retrieval process, and that the latent semantic

structure can be more accurately estimated with statistical techniques.

In LSI, a semantic space is constructed based on a large matrix of term-document

association observations. LSI uses a mathematical technique called Singular Value

Decomposition (SVD). One can approximate the original, usually very large, term

by document matrix by a truncated SVD matrix. A proper truncation can remove

noise data from the original data as well as improve the recall and precision of

information retrieval.

Mapping Scientific Frontiers: The Quest for Knowledge Visualization

Search WWH ::

Custom Search

Home