Analysis of Text Patterns Using Kernel Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

1.4 Example

At the end this chapter, we want to show by an example the main charac-

teristics of Kernel Methods. In Figure 1.1 in Section 1.2, we have introduced

concepts as kernel function, kernel matrix, and pattern analysis algorithm;

now we see how they work in practice.

In this example, for more details see ( 5 ), we model the evolution of linguistic

sequences by comparing their statistical properties. We will see how languages

belonging to the same linguistic family have very similar statistical proper-

ties. We will use these statistical properties to embed the sequences into a

vector space, to obtain their pairwise distances and hypothesize an evolution-

ary tree. The comparison among languages is performed by p -spectrum kernel

and mismatch kernel. Both algorithms demonstrated are based on computing

the distance between documents in feature space as defined in equation (1.1)

in Section 1.2.1.

We have used the language dataset introduced by (2). Their dataset is

made of the translation of the “Universal Declaration of Human Rights” (20)

in the most important language branches: Romance, Celtic, German, Slavic,

Ugrofinnic, Altaic, Baltic, and the Basque language. Our dataset contains

42 languages from this dataset. Each document has been preprocessed and

it has been transformed into a string of the letters belonging to the English

alphabet plus the space.

The experiments have been performed with value of p = 4, allowing one

mismatch. With both the kernels, we have obtained a kernel matrix of size

42

42. From the kernel matrix we have computed the distance matrix us-

ing equation (1.1). On the distance matrix, we have applied two different

pattern analysis algorithms, neighbor joining (NJ) algorithm (22; 28) and

multidimensional scaling (MDS) algorithm (14). NJ is a standard method

in computational biology for reconstructing phylogenetic trees based on pair-

wise distances between the leaf taxa. MDS is a visualization tool for the

exploratory analysis of high-dimensional data.

Here we present results relative to the p -spectrum kernel with p =4;there

are various elements of interest, both where they match accepted taxonomy

and where they (apparently) violate it. The Neighbor Joining tree, see Figure

1.2 , correctly recovers most of the families and subfamilies that are known

from linguistics. An analysis of the order of branching of various subfamilies

shows that our statistical analysis can capture interesting relations, e.g., the

recent split of the Slavic languages in the Balkans; the existence of a Scandi-

navian subfamily within the Germanic family; the relation between Afrikaans

and Dutch; the Celtic cluster; and the very structured Romance family. A

look at the MDS plot, Figure 1.3 , shows that English ends up halfway between

Romance and Germanic clusters; and Romanian is close to both Slavic and

Turkic clusters.

×

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home