Information Technology Reference
In-Depth Information
different kinds of texts feature different quantities of features. For example, Biber
[2] identified if clauses and singular person pronoun use as key predictors in distin-
guishing British- from American-English. Louwerse et al. [37] used cohesion scores
generated from Coh-Metrix to distinguish both spoken from written texts and nar-
ratives from non-narratives. And Stamatatos, Fakotatos, and Kokkinakis [50] used a
number of style markers including punctuation features and frequencies of verb- and
noun-phrases to distinguish between the authors of a variety of newspaper columns.
Clearly, discriminating texts by treating them as homogenous wholes has a good
track record. However, texts tend to be heterogeneous , and treating them as such
may substantially increase the power of corpus analyses.
The parts of a text serve the textual whole either by function or by form. In
terms of function , Propp [48] identified that texts can be comprised of fundamental
components, fulfilled by various characters, performing set functions. In terms of
form , numerous theories of text structure have demonstrated how textual elements
are inter-related [24, 25, 31, 40]. Labov's narrative theory, to take one example,
featured six key components: the abstract (a summary), the orientation (the cast of
characters, the scene, and the setting), the action (the problem, issue, or action), the
evaluation (the story's significance), the resolution (what happens, the denouement),
and the coda (tying up lose ends, moving to the present time and situation).
Unfortunately for text researchers, the identification of the kinds of discourse
markers described above has proven problematic because the absence or ambiguity of
such textual markers tends to lead to limited success [39]. This is not to say that there
has been no success at all. Morris and Hirst [46], for example, developed an algorithm
that attempted to uncover a hierarchical structure of discourse based on lexical
chains. Although their algorithm was only manually tested, the evidence from their
study, suggesting that text is structurally identifiable through themes marked by
chains of similar words, supports the view that the elements of heterogeneous texts
are identifiable. Hearst [23] developed this idea further by attempting to segment
expository texts into topically related parts. Like Morris and Hirst [46], Hearst
used term repetition as an indicator of topically related parts. The output of his
method is a linear succession of topics, with topics able to extend over more than
one paragraph. Hearst's algorithm is fully implementable and was also tested on
magazine articles and against human judgments with reported precision and recall
measures in the 60th percentile, meaning around 60% of topic boundaries identified
in the text are correct (precision) and 60% of the true boundaries are identified
(recall).
The limited success in identifying textual segments may be the result of searching
for a reliable fine grained analysis before a courser grain has first been established.
For example, a courser approach acknowledges that texts have easily identifiable
beginnings , middles , and ends , and these parts of a text , or at least a sample from
them, are not at all di cult to locate. Indeed, textual analysis using such parts
has proved quite productive. For example, Burrows [6] found that the introduction
section of texts rather than texts as a whole allowed certain authorship to be signifi-
cantly distinguished. And McCarthy, Lightman et al.[41] divided high-school science
and history textbook chapters into sections of beginnings, middles, and ends, finding
that reading di culty scores rose with significant regularity across these sections as
a chapter progressed.
If we accept that texts are comprised of parts, and that the text (as a whole)
is dependent upon the presence of each part, then we can form the hypothesis that
Search WWH ::




Custom Search