Information Technology Reference
In-Depth Information
record and reduceing the total number of record. There is the concept of negative
boundary in sampling algorithm. It utilizes the downward closeness of frequent
item sets. Suppose there is a downward close power set of an itemset SI, if all
subset of the itemset
itself,
then the set will be an element of the negative boundary. All elements with these
attributes form negative boundary, which marked as Bd(
s
are subsumed by
S
and the set is not subsumed by
S
S
). For example, if
I
={
A,B,C,D,E
} S={{
A
},{
B
},{
C
},{
E
}, {
A,B
},{
A,C
},{
A,E
},{
C,E
},{
A,C,E
}}, then
Bd(
S
)={{
B,C
},{
B,E
},{
D
}}.
S
is the superset of itemsets with high frequency if
none of the sets in Bd(
) with high frequency.
For database DB in Figure 12.1, we can get the candidate itemsets
S
1 when
we count the number of every item by scanning the whole sets of transaction in
the first scanning of database. If the minum support is 2, then the single
dimension of itemsets
C
1 is produced. Because there is the minium support of any
subset of large itemsets, Apriori calculates the
L
L
2 using
L 1 *L 1 . The operation * is
defined as
L
1 *
L
2 ={
X Y
|
X
,
Y L k, |
X
ŝ
Y
|=
k −1}
Then the candidate itemsets
C
2 is produce and
L
2 is produced from the minium
support. From
L
2 to
C
3 , the two itemsets which have the same head item: {
BC
}
and {
BE
} will be confirmed. Then the rear itemsets {
BC
} and {
BE
}, which are
defined as {
} are tested if they are satisfied with the minium support. Because
they are satisfied with the minium support, all the two dimension subsets of
{
CE
} is a candidate itemset. Because there isn't
any three dimension candidate itemsets from
BCE
} are large itemsets, so {
BCE
3 is gained.
Then there are no higher dimension itemsets any more. So the whole large
itemsets are confirmed.
Except Apriori algorithm, sampling algorithm and DIC algorithm also are
classical association rules mining algorithm. The main idea of sampling
algorithm is to define a value lowsup which is less than minsup and to calculate
the itemsets which Support(db(
L
2 ,
C
3 is confirmed and
L
)) lowsup on sampling data with the help of
algorithm of sampling. This itemsets is marked as
X
S
. Suppose
S
is the super set of
frequent itemsets. Negative boundary Bd(
S
) is calculated. Scanning DB and
S
and the support of Bd(
S
) are calculated. If no set in Bd(
S
) is frequent itemsets,
then
is the super set of frenquent itemsets. Otherwise failure is reported and
frequent itemsets in Bd(
S
) is
calculated and DB is scanned and so on until no frequent itemsets can be added
into S.
S
) are added into
S
. The negatice boundary of Bd(
S
Search WWH ::




Custom Search