Latent Semantic Space for Web Clustering - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

associations. Documents are amorphous. An isomorphism essentially implies

identity. So finding associations in a collection of textual documents is an

important information and a challenging problem.

Traditional text mining generally consists of the analysis of a text docu-

ment by extracting key words, phrases, concepts, etc. and representing in an

intermediate form refined from the original text in that manner for further

analysis with data mining techniques (e.g., to determine associations of con-

cepts, key phrases, names, addresses, product names, etc.). Feldman and his

colleagues [6,7,9] proposed the KDT and FACT system to discover association

rules based on keywords labeling the documents, the background knowledge of

keywords and relationships between them. This is not an effective approach,

because a substantially large amount of background knowledge is required.

Therefore, an automated approach that documents are labeled by the rules

learned from labeled documents are adopted [13]. However, several associa-

tion rules are constructed by a compound word (such as “Wall” and “Street”

often co-occur) [19]. Feldman et al. [6, 8] further proposed term extraction

modules to generate association rules by selected key words. Nevertheless, a

system without the needs of human labeling is desirable. Holt and Chung [11]

addressed Multipass-a priori and Multipass-DHP algorithms to e ciently find

association rules in text by modified the a priori algorithm [2] and the DHP

algorithm [18] respectively. However, these methods did not consider about

the word distribution in a document, that is, identify the importance of a

wordinadocument.

According to the trivial definition of distance measure in this space, no

matter what kind of a method is, some common words are more frequent in

a document than other words. Simple frequency of the occurrence of words is

not adequate, as some documents are larger than others. Furthermore, some

words may occur frequently across documents. In most cases, words appeared

in a few documents tend to most “important.” Techniques such as TFIDF [21]

have been proposed directly to deal with some of these problems. The TFIDF

value is the weight of term in each document. While considering relevant

documents to a search query, if the TFIDF value of a term is large, then it

will pull more weight than terms with lesser TFIDF values.

A general framework for text mining consists of two phases. The first phase,

feature extraction , is to extract key terms from a collection of “indexed” doc-

uments; as a second step various methods such as association rules algorithms

may be applied to determine relations between features.

While performing association analyses on a collection of documents, all

documents should be indexed and stored in an intermediate form. Docu-

ment indexing is originated from the task of assigning terms to documents

for retrieval or extraction purposes. In early approach, an indexing model was

developed based on the assumption that a document should be assigned those

terms that are used by queries to retrieve the relevant document [10, 16].

The weighted indexing is the weighting of the index terms with respect

to the document with this model given a theoretical justification in terms

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home