ONTECTAS: Bridging the Gap between Collaborative Tagging Systems and Structured Data - Advanced Information Systems Engineering

Information Technology Reference

In-Depth Information

Algorithm 2. Association Rule Tuple Detection

Input: ( D )Asetof2-tuplesinformof item, tag

Output: ( T ) Preliminary tag tuples, ( F ) Set of frequent itemsets

1: Group D by item. /*create: item, {tag 1 , ..., tag k } */

2: S ← Union of tags associated with each item (i.e., S is set of transactions)

3: F ← Frequent itemsets of size two from S where support >minsupport

/* FC i and RC i are forward and reverse confidence respectively*/

4: for all F i ∈ F do

5: if (( FC i ≥ min conf. )and( RC i ≤ 1 − min conf. )) OR (( RC i ≥ min conf. )

and ( FC i ≤ 1 − min conf. )) then

6: Add F i to T

7: end if

8: end for

9: Return T,F

(e.g., “os” and “operating system”), we use confidence in the reverse direction to

ensure that terms are related with

relationships. Different values

for min support and min conf. can drastically change the size of the ontology;

in our experiments these values were chosen empirically. At the end of this step,

we have not yet classified the relationships into

is-a

or

has-a

is-a

and

has-a

.

4.3 Pruning Edges between Bi-gram Elements

In this phase, bi-gram tuples which are common phrases are automatically

pruned using a search engine. Usually bi-grams are compound nouns in the

form of “adjective + noun” (e.g., free software) or “noun + noun” (e.g., web

browser). Bi-grams do not contain

is-a

has-a

relationships but sometimes

are incorrectly detected as edges of an ontology since they co-occur frequently.

Finding bigrams by using a search engine [26,12,17] has not previously been

applied to extracting relationships between CTS tags. ONTECTAS sends two

keyword queries to a search engine for each relationship tuple (Algorithm 3).

The queries are the quoted permutations of the terms in the tuple. If the ratio of

the number of results returned for the two queries is larger than a threshold, the

terms in the relationship tuple are regarded as bi-grams. E.g., if the relationship

tuple is

or

, the queries are “free software” and “software free”.

Since the ratio is higher than the threshold for this tuple, it is detected as a

bi-gram and pruned. We experimentally found that the optimal threshold for

detecting bi-grams is between 50 and 100. Because words in text documents

have Zipfian distribution, [12] suggests using a logarithmic transformation of

returned result counts. We found that the logarithmic transformation is also

more accurate in detecting bi-grams.

software, free

4.4 Detecting Headwords in Multi-word Tags

Since many CTS tags are multi-word tags in form of compound phrases such as

“science-fiction” and “object-oriented-data-model”, we use headword detection

Advanced Information Systems Engineering

Search WWH ::

Custom Search

Home