Database Reference
In-Depth Information
the first table with one or more matching records in the second table. The
OCCT algorithm characterizes the entities that should be linked together.
The tree is built such that it is easy to understand and transform into
association rules, i.e. the inner nodes consist only of features describing the
first set of entities, while the leaves of the tree represent features of their
matching entities from the second dataset. OCCT can be applied with four
different splitting criteria:
•
Coarse-grained Jaccard (CGJ) coecient — It is based on the Jaccard
similarity coecient described above. It aims to choose the splitting
attribute which leads to the smallest possible similarity between the
subsets (i.e. an attribute that generates subsets that are different from
each other as much as possible). In order to do so, we need examine each
of the possible splitting attributes and measure the similarity between
the subsets.
•
Fine-grained Jaccard (FGJ) coecient — The fine-grained Jaccard
coecient is capable of identifying partial record matches, as opposed
to the coarse-grained method, which identifies exact matches only. It not
only considers records which are exactly identical, but also checks to
what extent each possible pair of records is similar.
•
Least probable intersections (LPI) — In this measure the optimal
splitting attribute is the attribute that leads to the minimum amount
of instances that are shared between two item-sets. The criterion relies
on the cumulative distribution function (CDF) of the Poisson distribution
and described in details in the next chapter.
•
Maximum likelihood estimation (MLE) — Given a candidate split, a
probabilistic model is trained (such as decision tree) for each of the
split's subsets. The idea is to choose the split that achieves the maximal
likelihood.
8.5 Hidden Markov Model Trees
Hidden Markov Model is a method for sequence learning which is capable
to estimate the probability of sequences by training from data. HMM is
a type of dynamic Bayesian network (DBN). It is a stochastic process
with an underlying unobservable (hidden) stochastic process that can only
be observed through another set of stochastic processes that produce the
sequence of observed symbols. HMM can be viewed as a specific instance of
a state-space model in which the latent variables are discrete. In HMM, the
probability distribution of
z
n
depends on the state of the previous latent
Search WWH ::
Custom Search