From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

different classes. Especially, as subsequences of frequent patterns are all frequents,

their similarity is high. Therefore, models are restricted to maximal frequent patterns.

The second step assigns the sequences of the database to the classes that are repre-

sented by those models. This is performed based on similarity computation between

the database sequences and those models . However, most algorithms rather compute

the dissimilarity distance . Depending on the selected similarity algorithm, the more

the dissimilarity distance is small, the more the candidates are similar i.e. have com-

mon ordered items. We adopt LCSS algorithm because the presence of each item in

the sequence is here more relevant than its occurrence probability (see section 2). A

dissimilarity threshold allows the user to adjust the overlapping between classes as

well as the class coverage. The more this threshold will be high, the more the over-

lapping probability will be high. Inversely, if this threshold is too small, too many

database sequences may not be clustered.

The final step is to discover the main characteristics of individuals (profiles) charac-

terizing each class i.e. the dimension values specific to a given class . In order to extract

such knowledge, we propose an algorithm based on the definition of the characteriza-

tion proposed by Han and. Kamber [9]. The algorithm is detailed in Fig. 15 above.

8.5.3 Experimentation

This experimentation has been done using the real dataset provided by household

activity survey . The dataset contains 10840 individuals, with the dimensions: gender,

age (10 classes of age) and work types (about 9). The total number of distinct se-

quence activities is about 3429. The tests have been realized using the following

threshold values: support =0.1 ; dissimilarity =1 ; characterization = 8.

The algorithm has found 17 classes. For instance, the sequential pattern HMRH has

been mined with a support equal to 0.14. Its similar sequences are: HWMRH, HMHR,

HMRH, HMMRH, HRH, HLMRH, HMH, and HMRRH. This set of sequences com-

poses the class of HMRH. The following rules have been mined for this class:

(gender = F)

∧

(30 < age < 40)

∧

(job category = TE8) Î HMRH [9.064].

(gender = F)

∧

(50 < age < 60)

∧

(job category = TE7) Î HMRH [10.413].

(gender = F)

∧

(50 < age < 60)

∧

(job category = TE5) Î HMRH [12.466].

(job category = TE6) Î HMRH [10.407].

The three first rules mean that individuals having an activity sequence similar to

HMRH standing for (Home, Market, Restaurant, Home) are frequently (significance

between 9.064 and 12.466) women between 30 and 40 years old working as liberal

profession (here TE8), or women working as clerk (TE5) or without profession (TE7)

between 50 and 60 years old.

(gender = H)

∧

(60 < age < 70)

∧

8.6 Conclusion and Future Work

This chapter has proposed novel approaches for mining sequential patterns and multi-

dimensional characterization. The first part has presented a new algorithm IBM and its

variants IBM2 , IBM_OPT and IBM2_OPT . The aim of this algorithm is to mine

frequent sequences in item sequences. IBM only makes one scan of the database and

provides efficient data structure that optimizes memory space as well as access costs.

Mining Complex Data

Search WWH ::

Custom Search

Home