Beyond Classification Tasks - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

the first table with one or more matching records in the second table. The

OCCT algorithm characterizes the entities that should be linked together.

The tree is built such that it is easy to understand and transform into

association rules, i.e. the inner nodes consist only of features describing the

first set of entities, while the leaves of the tree represent features of their

matching entities from the second dataset. OCCT can be applied with four

different splitting criteria:

•

Coarse-grained Jaccard (CGJ) coecient — It is based on the Jaccard

similarity coecient described above. It aims to choose the splitting

attribute which leads to the smallest possible similarity between the

subsets (i.e. an attribute that generates subsets that are different from

each other as much as possible). In order to do so, we need examine each

of the possible splitting attributes and measure the similarity between

the subsets.

•

Fine-grained Jaccard (FGJ) coecient — The fine-grained Jaccard

coecient is capable of identifying partial record matches, as opposed

to the coarse-grained method, which identifies exact matches only. It not

only considers records which are exactly identical, but also checks to

what extent each possible pair of records is similar.

•

Least probable intersections (LPI) — In this measure the optimal

splitting attribute is the attribute that leads to the minimum amount

of instances that are shared between two item-sets. The criterion relies

on the cumulative distribution function (CDF) of the Poisson distribution

and described in details in the next chapter.

•

Maximum likelihood estimation (MLE) — Given a candidate split, a

probabilistic model is trained (such as decision tree) for each of the

split's subsets. The idea is to choose the split that achieves the maximal

likelihood.

8.5 Hidden Markov Model Trees

Hidden Markov Model is a method for sequence learning which is capable

to estimate the probability of sequences by training from data. HMM is

a type of dynamic Bayesian network (DBN). It is a stochastic process

with an underlying unobservable (hidden) stochastic process that can only

be observed through another set of stochastic processes that produce the

sequence of observed symbols. HMM can be viewed as a specific instance of

a state-space model in which the latent variables are discrete. In HMM, the

probability distribution of z n depends on the state of the previous latent

Search WWH ::

Custom Search

Home