Generating Concept Hierarchies from User Queries - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

1. select queries to cluster

2. select keywords associated with queries

3. reduce set of keywords to manageable length

4. generate feature vector for each query

5. use HAC algorithm to cluster queries

6. apply min-max partitioning algorithm to flatten dendrogram

Fig. 1. HAC + P + FSR algorithm

database is used to retrieve the list of queries to cluster, the list of documents

that form the context for each query, and the list of keywords from those

documents that are used to form feature vectors for the queries (steps one

through four of Fig. 1). Section 3.2 describes HAC, which is a well known

clustering algorithm. The next section describes how min-max partitioning is

used to flatten the cluster hierarchy into a shallow multi-way tree. The last

section describes an alternative mechanism for partitioning the tree; it uses

the same partitioning algorithm but applies different metrics to the generated

clusters while partitioning.

The problems encountered extracting keywords from text snippets re-

turned by a search engine are all avoided by the HAC + P + FSR algorithm. It

has access to the internal data of the RightNow system; one key advantage of

this is that the documents in the FAQ repository have already been processed

using stop word lists and stemming, and the resulting keyword phrases have

been extracted and stored in a document index. Each phrase includes a count

of the number of times it appeared in different sections of the document (the

FAQ documents are structured and contain sections such as the title, key-

words, question, and answer). Currently, the HAC+P+FSR algorithm only

uses single-word phrases in its feature vectors.

The user queries have also been filtered by the stop word list, stemmed,

and stored along with a count of the documents matching the queries and

a list of the most relevant documents' IDs. The algorithm groups the search

queries by stemmed, filtered phrase and selects the most frequently occurring

queries. Queries are only included for which at least one matching document

was found; queries with no matches have no context and therefore cannot be

clustered.

The feature selection phase of the algorithm is shown in Fig. 2. On com-

pletion there are N unique queries to be clustered and at most K distinct

keywords. K will be the length of the feature vector for each query; it is desir-

able to limit K to a manageable value. This is essential for a FAQ repository;

even if the value of M , the number of documents per query, is 20, if the system

stores 500 keywords per document, there could potentially be 10,000 keywords

per query.

There are a number of available techniques for reducing the dimensionality

of a data set; see the analysis by Forman for a comparison of several methods

specifically for text classification problems [7]. However, these techniques all

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home