Information Technology Reference
In-Depth Information
ii) Number of Rules: The association rules' systems that support the itemset and
sequence mining usually generate a huge number of rules, and therefore, it is difficult
for the user to decide which rules to use.
iii) Scalability: Since most of the existing algorithms use a lattice structure in the
search space and need to scan the database more than once, they are not compatible
with very large databases.
In order to create holistic environments in data mining, micro-patterns and macro-
patterns must be differentiated [7]. The micro-patterns correspond to small
percentages of data; for instance, in association rules, it is usual to have a measure of
support that includes support values ≥ 5%, with high confidence rules being chosen.
On the other hand, the macro-patterns involve a large percentage of data, for example
in the regression model all data elements are used. The micro-patterns are
characterized by high confidence, while macro-patterns are characterized by high
support. There are other examples of micro-patterns: in sequence mining a support ≥
1% is frequent; in the classification problem, by using decision trees, each branch of
the tree corresponds to a small percentage of the data; in the classification problem
using the k-nearest neighbor, the comparisons are made using a reduced number of k
elements. Finally, regarding macro-patterns in techniques such as regression,
hypothesis testing, clustering or reduction of attributes, all data are taken into account.
The proposed algorithm was coined with the Latin name, Ramex, meaning “branch
of a tree”. Ramex introduces a new vision for classic problems of sequence mining,
considering the accumulation of events, and allowing the search of macro-patterns,
instead of searching only for micro-patterns.
Ramex provides a comprehensive view of the sequences, providing the user the
visualization of the data sequences with a special kind of tree, a poly-tree, which
shows all the items, but only the most relevant sequences retrieving the x-ray of the
dataset.
Ramex has been implemented in different scenarios: web mining [6] and financial
studies [16], [20]. This paper also includes part of the work of [8].
In Section 2, the related work with sequence and process mining is presented. In
Section 3, the proposed algorithm is reported and a numeric example is presented. In
Section 4, computational results are reported using the IBM Quest Synthetic datasets
generator. Finally, in Section 5, we draw some conclusions.
2
Related Work
In this section concepts are presented and definitions are established in order to be
reused in the following sections. Sequence Mining is referred, Process Mining is
introduced and the definition of Poly-tree is established.
2.1
Sequence Mining
The problem of pattern discovery is to extract interesting patterns from the data. It is
difficult to define what the ingredients of an interesting pattern are. The temporal data
mining can be divided into four different approaches: Periodic Patterns [22],
Sequential Discovery [4], Frequent Episodes [15] and Markov Chain Models [5].
Search WWH ::




Custom Search