Information Technology Reference
In-Depth Information
different classes. Especially, as subsequences of frequent patterns are all frequents,
their similarity is high. Therefore,
models
are restricted to maximal frequent patterns.
The second step assigns the sequences of the database to the classes that are repre-
sented by those
models.
This is performed based on similarity computation between
the database sequences and those
models
. However, most algorithms rather compute
the
dissimilarity distance
. Depending on the selected similarity algorithm, the more
the
dissimilarity distance
is small, the more the candidates are similar i.e. have com-
mon ordered items. We adopt LCSS algorithm because the presence of each item in
the sequence is here more relevant than its occurrence probability (see section 2). A
dissimilarity
threshold
allows the user to adjust the
overlapping
between classes as
well as the class coverage. The more this threshold will be high, the more the
over-
lapping
probability will be high. Inversely, if this threshold is too small, too many
database sequences may not be clustered.
The final step is to discover the main characteristics of individuals (profiles) charac-
terizing each
class
i.e. the dimension values specific to a given
class
. In order to extract
such knowledge, we propose an algorithm based on the definition of the characteriza-
tion proposed by Han and. Kamber [9]. The algorithm is detailed in Fig. 15 above.
8.5.3 Experimentation
This experimentation has been done using the real dataset provided by
household
activity survey
. The dataset contains 10840 individuals, with the dimensions: gender,
age (10 classes of age) and work types (about 9). The total number of distinct se-
quence activities is about 3429. The tests have been realized using the following
threshold values:
support
=0.1 ;
dissimilarity
=1 ;
characterization
= 8.
The algorithm has found 17 classes. For instance, the sequential pattern HMRH has
been mined with a support equal to 0.14. Its similar sequences are: HWMRH, HMHR,
HMRH, HMMRH, HRH, HLMRH, HMH, and HMRRH. This set of sequences com-
poses the class of HMRH. The following rules have been mined for this class:
(gender = F)
∧
(30 < age < 40)
∧
(job category = TE8)
Î
HMRH [9.064].
(gender = F)
∧
(50 < age < 60)
∧
(job category = TE7)
Î
HMRH [10.413].
(gender = F)
∧
(50 < age < 60)
∧
(job category = TE5)
Î
HMRH [12.466].
(job category = TE6)
Î
HMRH [10.407].
The three first rules mean that individuals having an activity sequence similar to
HMRH standing for (Home, Market, Restaurant, Home) are frequently (significance
between 9.064 and 12.466) women between 30 and 40 years old working as liberal
profession (here TE8), or women working as clerk (TE5) or without profession (TE7)
between 50 and 60 years old.
(gender = H)
∧
(60 < age < 70)
∧
8.6 Conclusion and Future Work
This chapter has proposed novel approaches for mining sequential patterns and multi-
dimensional characterization. The first part has presented a new algorithm
IBM
and its
variants
IBM2
,
IBM_OPT
and
IBM2_OPT
. The aim of this algorithm is to mine
frequent sequences in item sequences.
IBM
only makes one scan of the database and
provides efficient data structure that optimizes memory space as well as access costs.