Databases Reference
In-Depth Information
1. select queries to cluster
2. select keywords associated with queries
3. reduce set of keywords to manageable length
4. generate feature vector for each query
5. use HAC algorithm to cluster queries
6. apply min-max partitioning algorithm to flatten dendrogram
Fig. 1. HAC + P + FSR algorithm
database is used to retrieve the list of queries to cluster, the list of documents
that form the context for each query, and the list of keywords from those
documents that are used to form feature vectors for the queries (steps one
through four of Fig. 1). Section 3.2 describes HAC, which is a well known
clustering algorithm. The next section describes how min-max partitioning is
used to flatten the cluster hierarchy into a shallow multi-way tree. The last
section describes an alternative mechanism for partitioning the tree; it uses
the same partitioning algorithm but applies different metrics to the generated
clusters while partitioning.
The problems encountered extracting keywords from text snippets re-
turned by a search engine are all avoided by the HAC + P + FSR algorithm. It
has access to the internal data of the RightNow system; one key advantage of
this is that the documents in the FAQ repository have already been processed
using stop word lists and stemming, and the resulting keyword phrases have
been extracted and stored in a document index. Each phrase includes a count
of the number of times it appeared in different sections of the document (the
FAQ documents are structured and contain sections such as the title, key-
words, question, and answer). Currently, the HAC+P+FSR algorithm only
uses single-word phrases in its feature vectors.
The user queries have also been filtered by the stop word list, stemmed,
and stored along with a count of the documents matching the queries and
a list of the most relevant documents' IDs. The algorithm groups the search
queries by stemmed, filtered phrase and selects the most frequently occurring
queries. Queries are only included for which at least one matching document
was found; queries with no matches have no context and therefore cannot be
clustered.
The feature selection phase of the algorithm is shown in Fig. 2. On com-
pletion there are N unique queries to be clustered and at most K distinct
keywords. K will be the length of the feature vector for each query; it is desir-
able to limit K to a manageable value. This is essential for a FAQ repository;
even if the value of M , the number of documents per query, is 20, if the system
stores 500 keywords per document, there could potentially be 10,000 keywords
per query.
There are a number of available techniques for reducing the dimensionality
of a data set; see the analysis by Forman for a comparison of several methods
specifically for text classification problems [7]. However, these techniques all
Search WWH ::




Custom Search