Information Technology Reference
In-Depth Information
Algorithm 2. Association Rule Tuple Detection
Input: ( D )Asetof2-tuplesinformof item, tag
Output: ( T ) Preliminary tag tuples, ( F ) Set of frequent itemsets
1: Group D by item. /*create: item, {tag 1 , ..., tag k } */
2: S ← Union of tags associated with each item (i.e., S is set of transactions)
3: F ← Frequent itemsets of size two from S where support >minsupport
/* FC i and RC i are forward and reverse confidence respectively*/
4: for all F i ∈ F do
5: if (( FC i ≥ min conf. )and( RC i 1 − min conf. )) OR (( RC i ≥ min conf. )
and ( FC i 1 − min conf. )) then
6: Add F i to T
7: end if
8: end for
9: Return T,F
(e.g., “os” and “operating system”), we use confidence in the reverse direction to
ensure that terms are related with
relationships. Different values
for min support and min conf. can drastically change the size of the ontology;
in our experiments these values were chosen empirically. At the end of this step,
we have not yet classified the relationships into
is-a
or
has-a
is-a
and
has-a
.
4.3 Pruning Edges between Bi-gram Elements
In this phase, bi-gram tuples which are common phrases are automatically
pruned using a search engine. Usually bi-grams are compound nouns in the
form of “adjective + noun” (e.g., free software) or “noun + noun” (e.g., web
browser). Bi-grams do not contain
is-a
has-a
relationships but sometimes
are incorrectly detected as edges of an ontology since they co-occur frequently.
Finding bigrams by using a search engine [26,12,17] has not previously been
applied to extracting relationships between CTS tags. ONTECTAS sends two
keyword queries to a search engine for each relationship tuple (Algorithm 3).
The queries are the quoted permutations of the terms in the tuple. If the ratio of
the number of results returned for the two queries is larger than a threshold, the
terms in the relationship tuple are regarded as bi-grams. E.g., if the relationship
tuple is
or
, the queries are “free software” and “software free”.
Since the ratio is higher than the threshold for this tuple, it is detected as a
bi-gram and pruned. We experimentally found that the optimal threshold for
detecting bi-grams is between 50 and 100. Because words in text documents
have Zipfian distribution, [12] suggests using a logarithmic transformation of
returned result counts. We found that the logarithmic transformation is also
more accurate in detecting bi-grams.
software, free
4.4 Detecting Headwords in Multi-word Tags
Since many CTS tags are multi-word tags in form of compound phrases such as
“science-fiction” and “object-oriented-data-model”, we use headword detection
 
Search WWH ::




Custom Search