Database Reference
In-Depth Information
bigger UF-trees are constructed by UF-growth, more hyperlinks are adjusted by UH-
mine, and more estimated supports are computed by U-Eclat. Hence, longer runtimes
are required. Similarly, when minsup decreases, more frequent patterns are returned
and longer runtimes are also required. Moreover, the density of datasets also affects
runtimes. For instance, when datasets are dense (e.g., connect4 augmented with
existential probability), UF-trees obtain higher compression ratios and thus require
less time to traverse than sparse datasets (e.g., kosarak augmented with existential
probability). Some experimental results showed the following: (i) datasets with a
low number of distinct existential probabilities led to smaller UF-trees and shorter
runtimes for UF-growth (than U-Apriori); (ii) U-Apriori requires shorter runtimes
than UH-mine when minsup was low (e.g., minsup < 0.3 % for kosarak, minsup <
0.6 % for connect4) but vice versa when minsup was high; (iii) depending on the
number of samples, U-Eclat could take longer or shorter to run than U-Apriori.
11
Extension: Probabilistic Frequent Pattern Mining
The aforementioned algorithms all find (expected support-based) frequent patterns
from uncertain data. These are patterns with expected support meeting or exceed-
ing the user-specified threshold minsup . Note that expected support of a pattern
X provides users with frequency information of X summarized over all “possible
worlds”, but it does not reveal the confidence on the likelihood of X being frequent
(i.e., percentage of “possible worlds” in which X is frequent). However, know-
ing the confidence can be helpful in some applications. Hence, in recent years,
there is also algorithmic development on extending the notion of frequent patterns
based on expected support to useful patterns—such as probabilistic heavy hitters and
probabilistic frequent patterns as described below.
11.1
Mining Probabilistic Heavy Hitters
Although the expected support of an item x (i.e., a singleton pattern
) provides
users with an estimate of the frequency of x in many real-life applications, it is also
helpful to know the confidence on the likelihood of x being frequent in the uncertain
data in some other applications. Hence, Zhang et al. [ 56 ] formalized the notion of
probabilistic heavy hitters (i.e., probabilistic frequent items , which are also known as
probabilistic frequent singleton patterns) following the “possible world” semantics
[ 21 ] for probabilistic datasets of uncertain data.
{
x
}
Definition 14.4 Given (i) a probabilistic dataset D of uncertain data, (ii) a user-
specified support threshold φ , and (iii) a user-specified frequentness probability
threshold τ , the problem of mining probabilistic heavy hitters from uncertain
data is to find all ( φ , τ )-probabilistic heavy hitters (PHHs). An item x isa( φ , τ )-
probabilistic heavy hitter (PHH) if P ( sup ( x , W j )
|
W j |
) (where sup ( x , W j )
is the support of x in a random possible world W j and
| W j |
is the number of items
Search WWH ::




Custom Search