Databases Reference
In-Depth Information
associations. Documents are amorphous. An isomorphism essentially implies
identity. So finding associations in a collection of textual documents is an
important information and a challenging problem.
Traditional text mining generally consists of the analysis of a text docu-
ment by extracting key words, phrases, concepts, etc. and representing in an
intermediate form refined from the original text in that manner for further
analysis with data mining techniques (e.g., to determine associations of con-
cepts, key phrases, names, addresses, product names, etc.). Feldman and his
colleagues [6,7,9] proposed the KDT and FACT system to discover association
rules based on keywords labeling the documents, the background knowledge of
keywords and relationships between them. This is not an effective approach,
because a substantially large amount of background knowledge is required.
Therefore, an automated approach that documents are labeled by the rules
learned from labeled documents are adopted [13]. However, several associa-
tion rules are constructed by a compound word (such as “Wall” and “Street”
often co-occur) [19]. Feldman et al. [6, 8] further proposed term extraction
modules to generate association rules by selected key words. Nevertheless, a
system without the needs of human labeling is desirable. Holt and Chung [11]
addressed Multipass-a priori and Multipass-DHP algorithms to e ciently find
association rules in text by modified the a priori algorithm [2] and the DHP
algorithm [18] respectively. However, these methods did not consider about
the word distribution in a document, that is, identify the importance of a
wordinadocument.
According to the trivial definition of distance measure in this space, no
matter what kind of a method is, some common words are more frequent in
a document than other words. Simple frequency of the occurrence of words is
not adequate, as some documents are larger than others. Furthermore, some
words may occur frequently across documents. In most cases, words appeared
in a few documents tend to most “important.” Techniques such as TFIDF [21]
have been proposed directly to deal with some of these problems. The TFIDF
value is the weight of term in each document. While considering relevant
documents to a search query, if the TFIDF value of a term is large, then it
will pull more weight than terms with lesser TFIDF values.
A general framework for text mining consists of two phases. The first phase,
feature extraction , is to extract key terms from a collection of “indexed” doc-
uments; as a second step various methods such as association rules algorithms
may be applied to determine relations between features.
While performing association analyses on a collection of documents, all
documents should be indexed and stored in an intermediate form. Docu-
ment indexing is originated from the task of assigning terms to documents
for retrieval or extraction purposes. In early approach, an indexing model was
developed based on the assumption that a document should be assigned those
terms that are used by queries to retrieve the relevant document [10, 16].
The weighted indexing is the weighting of the index terms with respect
to the document with this model given a theoretical justification in terms
Search WWH ::




Custom Search