Uncertain Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

10

Discussion on Uncertain Frequent Pattern Mining

So far, we have described various uncertain frequent pattern mining algorithms. Tong

et al. [ 50 ] compared some of these algorithms.

In terms of functionality, the U-Apriori, UH-mine, UF-growth, UFP-growth,

CUF-growth, PUF-growth, MR-growth, U-Eclat, UV-Eclat and U-VIPER algo-

rithms all mine static datasets of uncertain data. Among them, the first seven mine the

datasets horizontally, whereas the remaining three algorithms mine the datasets verti-

cally. In contrast, the SUF-growth, UF-streaming, TUF-streaming, LUF-streaming,

UHS-Stream and TFUHS-Stream algorithms mine dynamic streaming uncertain

data. Unlike these 16 algorithms that find all frequent patterns, both U-FPS and U-FIC

algorithms find only those frequent patterns satisfying the user-specified constraints.

In terms of accuracy, most algorithms return all the patterns with expected sup-

port (over all “possible worlds”) meeting or exceeding the user-specified threshold

minsup . In contrast, U-Eclat returns patterns with estimated support (over only the

sampled “possible worlds”) meeting or exceeding minsup . Hence, U-Eclat may intro-

duce false positives (when the support is overestimated) or false negatives (when the

support is underestimated). More instantiations (i.e., more samples) helps improve

accuracy.

In terms of memory consumption, U-Apriori keeps a list of candidate patterns,

whereas the tree-based and hyperlinked structure based algorithms construct in-

memory structures (e.g., UF-tree and its variants, extended H-struct). On the one

hand, a UF-tree is more compact (i.e., requires less space) than the extended H-struct.

On the other hand, UH-mine keeps only one extended H-struct, whereas tree-based

algorithms usually construct more than one tree. Sizes of the trees may also vary. For

instance, when U-FPS handles a succinct and anti-monotone constraint C SAM , the

tree size depends on the selectivity of C SAM because only those items that individually

satisfy C SAM are stored in the UF-tree. Both CUF-tree and PUF-tree (for uncertain

data) can be as compact as a FP-tree (for precise data). When SUF-growth handles

streams, the tree size depends on the size of sliding window (e.g., a window of w

batches) because a list of w occurrence counts is captured in each node of SUF-trees

(cf. only one occurrence count is captured in each node of UF-trees). Moreover, when

items in probabilistic datasets take on a few distinct existential probability values, the

trees contain fewer nodes (cf. the number of distinct existential probability values

does not affect the size of candidate lists or the extended H-struct). Furthermore,

minsup and density also affect memory consumption. For instance, for a sparse

dataset called kosarak, different winners (requiring the least space) have been shown

for different minsup : U-Apriori when minsup < 0.15 %, UH-mine when 0.15 %

≤

minsup < 0.5 %, and, UF-growth when 0.5 %

≤

minsup ; for a dense dataset called

connect4, UH-mine was the winner for 0.2 %

minsup < 0.8 %.

In terms of performance, most algorithms perform well when items in probabilistic

datasets take on low existential probability values because these datasets do not lead

to long frequent patterns. When items in probabilistic datasets take on high existential

probability values, more candidates are generated-and-tested by U-Apriori, more and

≤

Frequent Pattern Mining

Search WWH ::

Custom Search

Home