ONTECTAS: Bridging the Gap between Collaborative Tagging Systems and Structured Data - Advanced Information Systems Engineering

Information Technology Reference

In-Depth Information

remove tuples corresponding to bigrams. Next, we detect headwords of multi-

word tags and use this to infer additional

is-a

relationships. We then use lexico-

syntactic patterns to extract additional

relationships. Finally,

we leverage pairs of tags sharing a common child in the extracted ontology to

infer additional

is-a

and

has-a

is-a

relationships. The next three sections describe the phases.

4.1 Preprocessing

The preprocessing step is primarily a cleaning step. It takes as input a CTS and

performs the following tasks: (1) Any user information is projected away; we

found looking at transactions at the level of group of users was most effective in

ontology extraction. (2) Words with non-English characters are removed from

the input data using the same method as in [5]. This adequately removed non-

English words from all of our datasets. (3) Basic stemming: singular nouns are

substituted for their plural forms. (4) Since tags occurring very infrequently are

not statistically reliable, we removed tags or items that occurred fewer than 5

times. This threshold was determined empirically. (5) Verbs and verb phrases

are removed by applying the Stanford parser 3 to each tag. This prunes tags that

are used for organizing but convey no meaning about the item being tagged [7].

4.2. Detecting Potential Relationships Using Association Rules

Adapting tagged data to market basket analysis requires defining how to build

transactions from tags, which in turn requires defining “co-occurrence”. We ex-

plored three different definitions of co-occurrence. Empirically, we determined

that the most effective co-occurrence definition is the following: Tags t and t

co-occur if both were used to tag the same item (by possibly different users). The

frequency of

t, t }

equals the number of distinct items which were assigned both

tags t and t . Our careful study of the best definition of co-occurrence [20] allows

us to more optimally use association rules than previous approaches, e.g., [23].

We use the FP-tree association rule mining algorithm [8] to extract frequent

tag sets 4 and interesting rules from the set of transactions. The support of a tag

set X is the proportion of transactions containing tag set X and the confidence

of a rule is defined as confidence( X

{

Y ) / support( X ) — i.e.,

the proportion of transactions in which X and Y occur together among those in

which X appears. In this paper, we refer to the well-known definition of confidence

as forward confidence (FC). We also introduce a new notion, reverse confidence

(RC) as follows: reverse confidence( X

⇒

Y ) = support( X

∪

Y ) / support( Y ).

We assume tags assigned by users tend to contain both a term in the ontology

and another term that has relationship with it. Therefore, if two keywords co-

occur frequently, they are likely to be related. We use support to filter sets of tags

with a cardinality of two. However, popular unrelated terms may occur together

frequently, so we use confidence to remove tuples containing unrelated tags.

Because terms which co-occur with high confidence are sometimes synonyms

⇒

Y ) := support( X

∪

3 http://nlp.stanford.edu/software/lex-parser.shtml

4 Tag sets correspond to itemsets in the context of frequent itemset mining.

Advanced Information Systems Engineering

Search WWH ::

Custom Search

Home