Information Technology Reference
In-Depth Information
remove tuples corresponding to bigrams. Next, we detect headwords of multi-
word tags and use this to infer additional
is-a
relationships. We then use lexico-
syntactic patterns to extract additional
relationships. Finally,
we leverage pairs of tags sharing a common child in the extracted ontology to
infer additional
is-a
and
has-a
is-a
relationships. The next three sections describe the phases.
4.1 Preprocessing
The preprocessing step is primarily a cleaning step. It takes as input a CTS and
performs the following tasks: (1) Any user information is projected away; we
found looking at transactions at the level of group of users was most effective in
ontology extraction. (2) Words with non-English characters are removed from
the input data using the same method as in [5]. This adequately removed non-
English words from all of our datasets. (3) Basic stemming: singular nouns are
substituted for their plural forms. (4) Since tags occurring very infrequently are
not statistically reliable, we removed tags or items that occurred fewer than 5
times. This threshold was determined empirically. (5) Verbs and verb phrases
are removed by applying the Stanford parser
3
to each tag. This prunes tags that
are used for organizing but convey no meaning about the item being tagged [7].
4.2. Detecting Potential Relationships Using Association Rules
Adapting tagged data to market basket analysis requires defining how to build
transactions from tags, which in turn requires defining “co-occurrence”. We ex-
plored three different definitions of co-occurrence. Empirically, we determined
that the most effective co-occurrence definition is the following: Tags
t
and
t
co-occur if both were used to tag the same item (by possibly different users). The
frequency of
t, t
}
equals the number of distinct items which were assigned both
tags
t
and
t
. Our careful study of the best definition of co-occurrence [20] allows
us to more optimally use association rules than previous approaches, e.g., [23].
We use the FP-tree association rule mining algorithm [8] to extract frequent
tag sets
4
and interesting rules from the set of transactions. The
support
of a tag
set
X
is the proportion of transactions containing tag set
X
and the
confidence
of a rule is defined as confidence(
X
{
Y
)
/
support(
X
) — i.e.,
the proportion of transactions in which
X
and
Y
occur together among those in
which
X
appears. In this paper, we refer to the well-known definition of confidence
as forward confidence (FC). We also introduce a new notion, reverse confidence
(RC) as follows: reverse confidence(
X
⇒
Y
) = support(
X
∪
Y
)
/
support(
Y
).
We assume tags assigned by users tend to contain both a term in the ontology
and another term that has relationship with it. Therefore, if two keywords co-
occur frequently, they are likely to be related. We use
support
to filter sets of tags
with a cardinality of two. However, popular unrelated terms may occur together
frequently, so we use
confidence
to remove tuples containing unrelated tags.
Because terms which co-occur with high confidence are sometimes synonyms
⇒
Y
) := support(
X
∪
3
http://nlp.stanford.edu/software/lex-parser.shtml
4
Tag sets correspond to itemsets in the context of frequent itemset mining.