Database Reference
In-Depth Information
Item
a
Expected support
( a :0.2):1
( a :0.6):2
( a :0.9):1
2.3
b
2.2
c
c
1 8
.
( b :0.9):1
( b :0.6):1
( b :0.5):1
( b :0.2):1
d
1.4
e
1.0
( c :0.4):1
( c :0.6):1
( d :0.5):1
( c :0.8):1
( d :0.9):1
( e :0.7):1
7):1
(( e :0.3):1
Fig. 14.2 The UF-tree for the probabilistic dataset D 2 of uncertain data
each frequent item—i.e., each singleton pattern) and to enumerate frequent patterns
for higher levels (by traversing the tree paths and decrementing the occurrence counts)
during the mining process.
On the one hand, as paths in a UF-tree are shared only if they have the same item
and the same existential probability, the UF-tree accurately captures the contents
(especially, the existential probabilities) of the probabilistic datasets of uncertain
data so that frequent patterns can be mined without producing false positives or false
negatives. On the other hand, the UF-tree may be large and may not be as compact
as its counterpart for precise data (i.e., FP-tree).
5.2
UFP-growth
To make the tree more compact by reducing the tree size (via a reduction in the
number of tree nodes), Aggarwal et al. [ 13 ] proposed the UFP-growth algorithm .
Like UF-growth, the UFP-growth algorithm also scans the probabilistic dataset of
uncertain data twice and builds a UFP-tree . As nodes for item x having similar
existential probability values are clustered into a mega-node, the resulting mega-
node in the UFP-tree captures (i) an item x , (ii) the maximum existential probability
value (among all nodes within the cluster), and (iii) its occurrence count (i.e., the
number of nodes within the cluster). Tree paths are shared if the nodes on these paths
share the same item but similar existential probability values. In other words, the
path sharing condition is less restrictive than that of the UF-tree. See Fig. 14.3 .By
extracting appropriate tree paths and constructing UFP-trees for subsequent projected
databases, UFP-growth finds all truly frequent patterns at the end of the second
scan of the probabilistic dataset of uncertain data. At the same time, due to the
approximate nature (e.g., caused by the use of the maximum existential probability
value among all the nodes clustered into a mega-node) of UFP-growth, UFP-growth
also finds some infrequent patterns (i.e., some false positives ) in addition to those
truly frequent patterns (i.e., true positives). Hence, a third scan of the probabilistic
dataset of uncertain data is then required to remove these false positives.
Search WWH ::




Custom Search