Database Reference
In-Depth Information
10
Discussion on Uncertain Frequent Pattern Mining
So far, we have described various uncertain frequent pattern mining algorithms. Tong
et al. [ 50 ] compared some of these algorithms.
In terms of functionality, the U-Apriori, UH-mine, UF-growth, UFP-growth,
CUF-growth, PUF-growth, MR-growth, U-Eclat, UV-Eclat and U-VIPER algo-
rithms all mine static datasets of uncertain data. Among them, the first seven mine the
datasets horizontally, whereas the remaining three algorithms mine the datasets verti-
cally. In contrast, the SUF-growth, UF-streaming, TUF-streaming, LUF-streaming,
UHS-Stream and TFUHS-Stream algorithms mine dynamic streaming uncertain
data. Unlike these 16 algorithms that find all frequent patterns, both U-FPS and U-FIC
algorithms find only those frequent patterns satisfying the user-specified constraints.
In terms of accuracy, most algorithms return all the patterns with expected sup-
port (over all “possible worlds”) meeting or exceeding the user-specified threshold
minsup . In contrast, U-Eclat returns patterns with estimated support (over only the
sampled “possible worlds”) meeting or exceeding minsup . Hence, U-Eclat may intro-
duce false positives (when the support is overestimated) or false negatives (when the
support is underestimated). More instantiations (i.e., more samples) helps improve
accuracy.
In terms of memory consumption, U-Apriori keeps a list of candidate patterns,
whereas the tree-based and hyperlinked structure based algorithms construct in-
memory structures (e.g., UF-tree and its variants, extended H-struct). On the one
hand, a UF-tree is more compact (i.e., requires less space) than the extended H-struct.
On the other hand, UH-mine keeps only one extended H-struct, whereas tree-based
algorithms usually construct more than one tree. Sizes of the trees may also vary. For
instance, when U-FPS handles a succinct and anti-monotone constraint C SAM , the
tree size depends on the selectivity of C SAM because only those items that individually
satisfy C SAM are stored in the UF-tree. Both CUF-tree and PUF-tree (for uncertain
data) can be as compact as a FP-tree (for precise data). When SUF-growth handles
streams, the tree size depends on the size of sliding window (e.g., a window of w
batches) because a list of w occurrence counts is captured in each node of SUF-trees
(cf. only one occurrence count is captured in each node of UF-trees). Moreover, when
items in probabilistic datasets take on a few distinct existential probability values, the
trees contain fewer nodes (cf. the number of distinct existential probability values
does not affect the size of candidate lists or the extended H-struct). Furthermore,
minsup and density also affect memory consumption. For instance, for a sparse
dataset called kosarak, different winners (requiring the least space) have been shown
for different minsup : U-Apriori when minsup < 0.15 %, UH-mine when 0.15 %
minsup < 0.5 %, and, UF-growth when 0.5 %
minsup ; for a dense dataset called
connect4, UH-mine was the winner for 0.2 %
minsup < 0.8 %.
In terms of performance, most algorithms perform well when items in probabilistic
datasets take on low existential probability values because these datasets do not lead
to long frequent patterns. When items in probabilistic datasets take on high existential
probability values, more candidates are generated-and-tested by U-Apriori, more and
Search WWH ::




Custom Search