Information Technology Reference
In-Depth Information
remove tuples corresponding to bigrams. Next, we detect headwords of multi-
word tags and use this to infer additional
is-a
relationships. We then use lexico-
syntactic patterns to extract additional
relationships. Finally,
we leverage pairs of tags sharing a common child in the extracted ontology to
infer additional
is-a
and
has-a
is-a
relationships. The next three sections describe the phases.
4.1 Preprocessing
The preprocessing step is primarily a cleaning step. It takes as input a CTS and
performs the following tasks: (1) Any user information is projected away; we
found looking at transactions at the level of group of users was most effective in
ontology extraction. (2) Words with non-English characters are removed from
the input data using the same method as in [5]. This adequately removed non-
English words from all of our datasets. (3) Basic stemming: singular nouns are
substituted for their plural forms. (4) Since tags occurring very infrequently are
not statistically reliable, we removed tags or items that occurred fewer than 5
times. This threshold was determined empirically. (5) Verbs and verb phrases
are removed by applying the Stanford parser 3 to each tag. This prunes tags that
are used for organizing but convey no meaning about the item being tagged [7].
4.2. Detecting Potential Relationships Using Association Rules
Adapting tagged data to market basket analysis requires defining how to build
transactions from tags, which in turn requires defining “co-occurrence”. We ex-
plored three different definitions of co-occurrence. Empirically, we determined
that the most effective co-occurrence definition is the following: Tags t and t
co-occur if both were used to tag the same item (by possibly different users). The
frequency of
t, t }
equals the number of distinct items which were assigned both
tags t and t . Our careful study of the best definition of co-occurrence [20] allows
us to more optimally use association rules than previous approaches, e.g., [23].
We use the FP-tree association rule mining algorithm [8] to extract frequent
tag sets 4 and interesting rules from the set of transactions. The support of a tag
set X is the proportion of transactions containing tag set X and the confidence
of a rule is defined as confidence( X
{
Y ) / support( X ) — i.e.,
the proportion of transactions in which X and Y occur together among those in
which X appears. In this paper, we refer to the well-known definition of confidence
as forward confidence (FC). We also introduce a new notion, reverse confidence
(RC) as follows: reverse confidence( X
Y ) = support( X
Y ) / support( Y ).
We assume tags assigned by users tend to contain both a term in the ontology
and another term that has relationship with it. Therefore, if two keywords co-
occur frequently, they are likely to be related. We use support to filter sets of tags
with a cardinality of two. However, popular unrelated terms may occur together
frequently, so we use confidence to remove tuples containing unrelated tags.
Because terms which co-occur with high confidence are sometimes synonyms
Y ) := support( X
3 http://nlp.stanford.edu/software/lex-parser.shtml
4 Tag sets correspond to itemsets in the context of frequent itemset mining.
 
Search WWH ::




Custom Search