Information Technology Reference
In-Depth Information
Input: i=1, j=1, sequence x, pattern t
Output: sum, number of t included in x
number(String[] x, int i, String[] t, int j)
{
if (x[i]=(“
∗
”)) i++; // If we have a star, skip it, it was already used
// If the star was the last character, found another match.
if (i = m AND x[i] = (“
∗
”)) return ++sum;
if (j = n)
}
if (i = 0 AND j = 0) sum = 0;
// The “ i
>
0 ” test simulates a starting star.
if (i
{
return sum;
{
if (x[i] = (t[j]) AND i = (m - 1))
{
sum++;
}
else if (x[i] = (t[j]))
>
0 AND x[i - 1]
=(“
∗
”))
{
number(x, i + 1, t, j + 1);
}}
else
{
for (int p = j; p
<
n; p++)
{
if (x[i] = (t[p]) AND i = (m - 1))
}
else if (x[i] = (t[p]))
{
number(x, i + 1, t, p + 1);
}
}
{
sum++;
}
return sum;
}
Fig. 4.
An algorithm to find out how frequent is each pattern within a certain number
of records
first (
k
-2) items are similar. This set of candidate is denoted
C
(
k
,0). Second,
the prune step:
C
(
k
,0) is a superset of
L
(
k
,0), that is, its elements may or may
not be frequent, but all of the frequent
k
-itemsets are included in
C
(
k
,0), even
if
C
(
k
,0) is very large. In fact any (
k
-1)-itemset that is not frequent cannot
be a subset of a frequent
k
-itemset. Hence, if any (
k
-1)-subset of a candidate
k
-itemset is not in
L
(
k
-1,0), then the candidate cannot be frequent either, and
so, can be removed from
C
(
k
,0). Suppose there are
n
records in the original
data set, to find all
n
large sequences, the number of connection will be 2
n
.To
build the signature of an attack with around 100 records, this structure is not
suitable.
In contrast, when we search for 1-itemsets candidate,
C
(1,0), with our pro-
posed algorithm, we need to scan the original records once and count all items,
the same as the Apriori algorithm. When searching for frequent 1-itemsets,
L
(1,0), instead of scanning original records, we only need to scan
C
(1,0) which
is composed of original records and much less than the original data. After gen-
erating all
L
(
k
,0), we scan the original records once, and every
C
(
k
,0) is also
scanned once. In total,
k
times of scanning are performed. Since any
L
(
k,l
)
is extracted from the corresponding
L
(
k
,0), we only need to scan the data
stored in the temporary database instead of the corresponding
C
(
k
,0) or origi-
nal records. The data quantity is reduced evidently. And the most important, by
taking out
C
(2,0), and only scan the corresponding
L
(1,0) which may compose
the
C
(2,0) in the temporary database. Then, the other
C
(2,0) is taken out in
turn.