Textual Signatures: Identifying Text-Types Using Latent Semantic Analysis to Measure the Cohesion of Text Structures - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

the parts of the text are inter-dependent and, therefore, are likely to be structurally

inter-related. In addition, as we know that cohesion exists in texts at the clausal,

sentential, and paragraph level [22], it would be no surprise to find that cohesion

also existed across the parts of the text that constitute the whole of the text . If this

were not the case, parts of text would have to exist that bore no reference to the

text as a whole. Therefore, if we measure the cohesion that exists across identifiable

parts of the text, we can predict the degree to which the parts co-refer would be

indicative of the kind of text being analyzed. In Labov's [31] narrative model, for

example, we might expect a high degree of coreference between the second section

(the orientation) and the sixth section (the coda): Although the two sections are

textually distant, they are semantically related in terms of the textual elements with

both sections likely to feature the characters, the motive of the story, and the scene

in which the story takes place. In contrast, we might expect less coreference between

the forth and fifth sections (evaluation and resolution): While the evaluation and

resolution are textually juxtaposed, the evaluation section is likely to offer a more

global, moral and/or abstracted perspective of the story. The resolution , however, is

almost bound to be local to the story and feature the characters, the scene, and the

outcome. Consequently, semantic relations between these two elements are likely to

be less marked.

By forming a picture of the degree to which textual parts inter-relate, we can

build a representation of the structure of the texts, a prototypical model that we

call the textual signature . Such a signature stands to serve students and researchers

alike. For students, their work can be analyzed to see the extent to which their paper

reflects a prototypical model. Specifically, a parts analysis may help students to see

that sections of their papers are under- or over-represented in terms of the global

cohesion. For researchers, a text-type signature should help significantly in mining

for appropriate texts. For example, the first ten web sites from a Google search for a

text about cohesion (featuring the combined keywords of comprehension , cohesion ,

coherence , and referential ) yielded papers from the field of composition theory, En-

glish as a foreign language, and cognitive science, not to mention a disparate array

of far less academic sources. While the specified keywords that were entered may

have occurred in each of the retrieved items, the organization of the parts of the

retrieved papers (and their inter-relatedness) would differ. Knowing the signatures

that distinguishes the text types would help researchers to locate more effectively

the kind of resources that they require. A further possible benefit of textual signa-

tures involves Question Answering (QA) systems [45, 52]. Given a question and a

large collection of texts (often in gigabytes), the task in QA is to draw a list of short

answers (the length of a sentence) to the question from the collection. The typical

architecture of a modern QA system includes three subsystems: question process-

ing, paragraph retrieval and answer processing. Textual signatures may be able to

reduce the search space in the paragraph retrieval stage by identifying more likely

candidates.

7.5 Latent Semantic Analysis

To assess the inter-relatedness of text sections we used latent semantic analysis

(hereafter, LSA ). An extensive review of the procedures and computations involved

in LSA is available in Landauer and Dumais [32] and Landauer et al. [33]. For this

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home