Information Technology Reference
In-Depth Information
Activity programs of most individuals may be the same or be similar. Each activity
program could be seen as a sequence of single value, making it possible to discover
frequent activity sequences that characterise groups of the surveyed individuals. This
allows analyzing the mobility of this urban population. Likewise, when considering
transport mode, schedules or duration sequences, it would be possible to determine a
typology of used transport modes, schedules, and so on. Besides, the survey holds
other information about individuals as their age, gender, profession, and so on. Then,
each group having similar activity pattern is likely to have some characteristic
attribute values. Hence, it is very relevant to characterize those groups and their corre-
sponding sequential patterns. Those data have been used in the second part yielding
characteristic rules for the groups of the surveyed individuals that share approxi-
mately those activity patterns. Moreover, other datasets have been used mainly to test
the scalability and to compare our algorithms in the most widely used contexts, such
as public datasets 1 .
8.3 A Sequential Pattern Mining Algorithm
This section proposes an algorithm of sequential pattern mining. We focus on the
specific case where the considered sequences composed of single items instead of
item-sets. We argue that this is the case if the most popular sequences, such as DNA
[9], documents, web-logs, or activity program sequences. Our algorithm is compared
to PrefixSPAN , and SPAN, ones of the most efficient mining algorithms.
8.3.1 Algorithm Overview
The proposed algorithm is two phases. The first stage is the data encoding into a
memory resident data structures. The second one is the frequent generation that in
turn is composed of candidate generation, and candidate support checking. This algo-
rithm only makes one scan of the database during which the total number of distinct
sequences, the frequency of these sequences and the number of sequences by size are
computed. This allows computing the support of each generated sequence.
The backbone of our approach is its main memory data structure, called IBM as
Indexed Bit Map. It is composed of four elements: (i) a Bitmap: a matrix that repre-
sents the distinct sequences of the database, (ii) SV : a sequence vector that encodes all
the ordered combinations of sequences, (iii) INDEX: an index on the Bitmap that
allows a direct access to sequences according to their size, (iv) NB: a table associated
to the Bitmap which informs about the frequency of each distinct sequence (Fig. 8.1).
Only distinct sequences are stored in the Bitmap, they are classified by decreasing
size. An index by size allows a direct access to sequences according to their size,
which optimizes the candidate generation and counting phase of the algorithm. In the
example of Fig. 8.1, IBM encodes the whole distinct sequences of the database.
Notice that Index and the Bitmap are numbered down-up. Here, there are 6 distinct
sequences of size 1 to 5: H, W, HSR, HSH, HSMH and HSRWH. Each cell of the
INDEX indicates the first line where the corresponding size of sequence is stored. For
1 http://kdd.ics.uci.edu
Search WWH ::




Custom Search