Information Technology Reference
In-Depth Information
-
Extract morphosyntactic descriptors (numeric values automatically extracted, such as
the number of vowels) for each word processed. Words were previously represented by
Porter's Stemming, but this tool does not have enough classification power for use as a
sole instrument. Morphosyntactic descriptors are required to process text with
sufficient confidence levels (López De Luise, 2007d).
-
Collapse syntagmas into a condensed internal representation (usually, selected
morphemes 6 ). The resulting representation is called EBH (Estructura Básica
Homogénea, uniform basic structure). EBHs are linked with specific connectors.
-
Calculate and set the morphosyntactic weighting p o for E ci .
More details of each of these steps are outside of the scope of this chapter (but see (López De
Luise, 2008c) and (López De Luise, 2008)).
3.2.3 Apply filtering using the most suitable approach
Since knowledge management depends on previous language experiences, filtering is
dynamic process that adapts itself to current cognitive capabilities. Furthermore, as shown
in the Case Study section, filtering is a very sensitive step in the MLW transformation.
Filtering is a process composed of several filters. The current paper includes the following
three clustering algorithms: Simple K-means, Farthest First and Expectation Maximization
(Witten, 2005). They are applied sequentially for each new E ce . When an E ce is “mature”, the
filter no longer changes.
The distance used to evaluate clustering is based on the similarity between the descriptor
values and the internal morphosyntactic metric, p o , that weights EBH (representing
morphemes). It has been shown that clusters generated with p o represent consistent word
agglomerations (López De Luise, 2008, 2008b). Although this chapter does not use fuzzy
clustering algorithms, it is important to note that such filters require a specific adaptation for
distance using the categorical metrics defined in (López De Luise, 2007e).
3.2.4 If “Abstraction” granularity and details are inadequate for the current problem
Granularity is determined by the ability to discriminate the topic and by the degree of detail
required to represent the E ci . In the MLW context it is the logic distance between the current
E ci and the E ce partitions 7 (see Figure 5). This distance depends on the desired learning
approach. In the example included herein (Section 4), it is the number of elements in the E ci
that fall within each E ce partition. The distribution of EBHs determines whether a new E ce is
a necessary. When the EBHs are too irregular, a new E ce is built per step 3.2.4.1. Otherwise
the new E ci is added to the partition that is the best match.
3.2.4.1 Insert a new filter, E ce , in the knowledge organization
The current E ce is cleaned so that it keeps all the E ci s that best match its partitions, and a new
E ce that includes all the E ci s that are not well represented is created and linked.
6 A meaningful linguistic unit that cannot be divided into smaller meaningful parts.
7 Partition in this context is a cluster obtained after the filtering process.
Search WWH ::




Custom Search