Information Technology Reference
In-Depth Information
different classes. Especially, as subsequences of frequent patterns are all frequents,
their similarity is high. Therefore, models are restricted to maximal frequent patterns.
The second step assigns the sequences of the database to the classes that are repre-
sented by those models. This is performed based on similarity computation between
the database sequences and those models . However, most algorithms rather compute
the dissimilarity distance . Depending on the selected similarity algorithm, the more
the dissimilarity distance is small, the more the candidates are similar i.e. have com-
mon ordered items. We adopt LCSS algorithm because the presence of each item in
the sequence is here more relevant than its occurrence probability (see section 2). A
dissimilarity threshold allows the user to adjust the overlapping between classes as
well as the class coverage. The more this threshold will be high, the more the over-
lapping probability will be high. Inversely, if this threshold is too small, too many
database sequences may not be clustered.
The final step is to discover the main characteristics of individuals (profiles) charac-
terizing each class i.e. the dimension values specific to a given class . In order to extract
such knowledge, we propose an algorithm based on the definition of the characteriza-
tion proposed by Han and. Kamber [9]. The algorithm is detailed in Fig. 15 above.
8.5.3 Experimentation
This experimentation has been done using the real dataset provided by household
activity survey . The dataset contains 10840 individuals, with the dimensions: gender,
age (10 classes of age) and work types (about 9). The total number of distinct se-
quence activities is about 3429. The tests have been realized using the following
threshold values: support =0.1 ; dissimilarity =1 ; characterization = 8.
The algorithm has found 17 classes. For instance, the sequential pattern HMRH has
been mined with a support equal to 0.14. Its similar sequences are: HWMRH, HMHR,
HMRH, HMMRH, HRH, HLMRH, HMH, and HMRRH. This set of sequences com-
poses the class of HMRH. The following rules have been mined for this class:
(gender = F)
(30 < age < 40)
(job category = TE8) Î HMRH [9.064].
(gender = F)
(50 < age < 60)
(job category = TE7) Î HMRH [10.413].
(gender = F)
(50 < age < 60)
(job category = TE5) Î HMRH [12.466].
(job category = TE6) Î HMRH [10.407].
The three first rules mean that individuals having an activity sequence similar to
HMRH standing for (Home, Market, Restaurant, Home) are frequently (significance
between 9.064 and 12.466) women between 30 and 40 years old working as liberal
profession (here TE8), or women working as clerk (TE5) or without profession (TE7)
between 50 and 60 years old.
(gender = H)
(60 < age < 70)
8.6 Conclusion and Future Work
This chapter has proposed novel approaches for mining sequential patterns and multi-
dimensional characterization. The first part has presented a new algorithm IBM and its
variants IBM2 , IBM_OPT and IBM2_OPT . The aim of this algorithm is to mine
frequent sequences in item sequences. IBM only makes one scan of the database and
provides efficient data structure that optimizes memory space as well as access costs.
Search WWH ::




Custom Search