Information Technology Reference
In-Depth Information
automated as concepts identified in a small set of narratives may be used to assess
the similarity among the full set of narratives. This results in an iterative process of
coding and visualization of obtained insights.
Next, we describe the application of a fully-automated procedure that relies on
a vector-space model (Salton et al., 1975), the Latent Semantic Analysis (LSA)
and motivate the proposed adaptations of this approach towards a semi-automated
approach.
6.2
Automated Approaches to Semantic Classification
A number of automated approaches exist for the assessment of semantic similarity
between documents (for an extensive review see Kaur and Hornof, 2005; Cohen and
Widdows, 2009). These approaches rely on the principle that the semantic similarity
between two documents relates to the degree of term co-occurrence in these docu-
ments (Deerwester et al., 1990). In this sense, every document may be characterized
as an n-dimensional vector where each element of the vector depicts the number of
times that a given term appears in the document. The similarity between documents
may then be computed in a high-dimensional geometrical space defined by these
vectors.
Latent-Semantic Analysis (LSA) (Deerwester et al., 1990), also known as Latent-
Semantic Indexing within the field of Information Retrieval, is one of the most pop-
ular vector-space approaches to semantic similarity measurement. It has been shown
to reflect human semantic similarity judgments quite accurately (Landauer and Du-
mais, 1997) and has been successfully applied in a number of contexts such as that
of identifying navigation problems in web sites (Katsanos et al., 2008) and structur-
ing and identifying trends in academic communities (Larsen et al., 2008a).
LSA starts by indexing all n terms that appear in a pool of m documents, and
constructs a nxm matrix A where each element a i,j depicts the number of times that
the term i appears in document j . As matrix A is high-dimensional and sparse, LSA
employs Singular-Value Decomposition (SVD) in reducing the dimensionality of
the matrix and thus identifying the principal latent dimensions in the data. Semantic
similarity can then be computed on this reduced dimensionality space which depicts
a latent semantic space. Below, we describe in detail the procedure as applied in this
chapter.
6.2.1
The Latent-Semantic Analysis Procedure
6.2.1.1
Term Indexing
Term-indexing techniques may vary from simple “bag-of-words” approaches that
discard the syntactic structure of the document and only index the full list of words
that appear in a document, to natural language algorithms that identify the part-of-
speech, e.g. the probability that a term is a noun or a verb, in inferring the essence
of a word (Berry et al., 1999). LSA typically discards syntactic information and
Search WWH ::




Custom Search