Database Reference
In-Depth Information
1.4 Example
At the end this chapter, we want to show by an example the main charac-
teristics of Kernel Methods. In Figure 1.1 in Section 1.2, we have introduced
concepts as kernel function, kernel matrix, and pattern analysis algorithm;
now we see how they work in practice.
In this example, for more details see ( 5 ), we model the evolution of linguistic
sequences by comparing their statistical properties. We will see how languages
belonging to the same linguistic family have very similar statistical proper-
ties. We will use these statistical properties to embed the sequences into a
vector space, to obtain their pairwise distances and hypothesize an evolution-
ary tree. The comparison among languages is performed by p -spectrum kernel
and mismatch kernel. Both algorithms demonstrated are based on computing
the distance between documents in feature space as defined in equation (1.1)
in Section 1.2.1.
We have used the language dataset introduced by (2). Their dataset is
made of the translation of the “Universal Declaration of Human Rights” (20)
in the most important language branches: Romance, Celtic, German, Slavic,
Ugrofinnic, Altaic, Baltic, and the Basque language. Our dataset contains
42 languages from this dataset. Each document has been preprocessed and
it has been transformed into a string of the letters belonging to the English
alphabet plus the space.
The experiments have been performed with value of p = 4, allowing one
mismatch. With both the kernels, we have obtained a kernel matrix of size
42
42. From the kernel matrix we have computed the distance matrix us-
ing equation (1.1). On the distance matrix, we have applied two different
pattern analysis algorithms, neighbor joining (NJ) algorithm (22; 28) and
multidimensional scaling (MDS) algorithm (14). NJ is a standard method
in computational biology for reconstructing phylogenetic trees based on pair-
wise distances between the leaf taxa. MDS is a visualization tool for the
exploratory analysis of high-dimensional data.
Here we present results relative to the p -spectrum kernel with p =4;there
are various elements of interest, both where they match accepted taxonomy
and where they (apparently) violate it. The Neighbor Joining tree, see Figure
1.2 , correctly recovers most of the families and subfamilies that are known
from linguistics. An analysis of the order of branching of various subfamilies
shows that our statistical analysis can capture interesting relations, e.g., the
recent split of the Slavic languages in the Balkans; the existence of a Scandi-
navian subfamily within the Germanic family; the relation between Afrikaans
and Dutch; the Celtic cluster; and the very structured Romance family. A
look at the MDS plot, Figure 1.3 , shows that English ends up halfway between
Romance and Germanic clusters; and Romanian is close to both Slavic and
Turkic clusters.
×
 
Search WWH ::




Custom Search