Information Technology Reference
In-Depth Information
Table 12
The maximum frequent itemset that user searches
N
o
itemset
1
computer, programming language, algorithm, derivative
We propose the new point of view: “
The maximum frequent
itemsets are
considered as documents and the classes of such documents are considered as
user interests
”. Such documents may be called interesting documents. Which
classes such interesting documents belong to are user interests. It means that
discovering user's interests involves in classifying interesting documents. Suppose
we have a set of classes
C
= {
computer science, math
}, a set of terms
T
=
{
computer, programming language, algorithm, derivative
} and the set of
classification rules in table 6. Each maximum frequent itemset that user searches
is modeled as a document vector (so-called interesting document vector or user
interest vector) whose elements are the support of its member items. Note that the
supports of such items are showed in table 8.
Table 13
Interesting document vector
N
o
vector
1
(computer=4, programming language=
2
, algorithm=
2
, derivative=
2
)
Table 14
Interesting document vector is normalized
vector
N
o
1
(computer=
0.4
, programming language=
0.2
, algorithm=
0.2
, derivative=
0.2
)
Table 15
Nominal interesting document vector
N
o
vector
1
(computer=
medium
, programming language=
medium
, algorithm=
medium
,
derivative=
medium
)
It is possible to use SVM or decision tree or neural network to classify
documents. Hence we use decision tree as sample classifier for convenience
because we intend to re-use classification rules in section III. Otherwise we must
determine the weight vector
W
*
if applying SVM approach. However SVM
approach is more powerful than decision tree with regard to document
classification in case of huge training data.
Applying classification rule
2
, the interesting document belongs to class
compute science
because the frequency of “derivative” and “computer” are
medium
and
medium
, respectively. So we can state that user
U
has only one
interest:
computer science
.
Note that in case of using neural network for document classification,
interesting document vector is specified as Boolean document vector (or Boolean