Information Technology Reference
In-Depth Information
/__[!?.]__$/ { print "\n" }' | ...
Explanation: The program merges lines that are not marked as sentence endings
by setting the output record separator ORS to a blank. If a line-end is marked as
sentence-end, then an extra newline character is printed.
Next, we merge all lines which start, e.g. , in a lower case word with its predeces-
sor since this indicates that we have identified a sentence within a sentence. Finally,
markers are removed and the “hidden” things are restored in the pipe. By conven-
tion, we deliberately accept that an abbreviation does not terminate a sentence.
Overall, our procedure creates double sentences on lines in rare cases. Nevertheless,
this program is su ciently accurate for the objectives outlined above in (1) and (2).
Note that it is easy to scan the output for lines possibly containing two sentences
and subsequently inspect a “diagnostic” file.
Application: The string “and so on” is extremely common in the writing of
Japanese learners of English, and it is objected to by most teachers. From the
examples listed above such as printPredecessorBecause , it is clear how to connect
the output of the sentence finder with a program that searches for and so on .
In [46], 121 very common mistakes made by Japanese students of English are
documented. We point out to the reader that a full 75 of these can be located
in student writing using the most simple of string-search programs, such as those
introduced above.
12.5.3 Readability of Texts
Hoey [22, pp. 35-48, 231-235] points out that the more cohesive a foreign language
text, the easier it is for learners of the language to read. One method Hoey proposes
for judging relative cohesion, and thus readability, is by merely counting the number
of repeated content words in the text (repetition being one of the main cohesive
elements of texts in many languages). Hoey concedes though that doing this “rough
and ready analysis” [22, p. 235] by hand is tedious work, and impractical for texts
of more than 25 sentences.
An analysis like this is perfectly suited for the computer, however. In principle,
any on-line text could be analyzed in terms of readability based on repetition. One
can use countWordFrequencies or a similar program to determine word frequencies
over an entire text or “locally.” Entities to search through “locally” could be para-
graphs or all blocks of, e.g. , 20 lines of text. The latter procedure would define a
flow-like concept that could be called “local context.” Words that appear at least
once with high local frequency are understood to be important. A possible exten-
sion of countWordFrequencies is to use spell -x to identify derived words such as
Japanese from Japan . Such a procedure aids teachers in deciding which vocabulary
items to focus on when assigning students to read the text, i.e. , the most frequently
occurring ones ordered by their appearance in the text.
Example: The next program implements a search for words that are locally
repeated ( i.e. , within a string of 200 words) in a text. In fact, we determine the
frequencies of words in a file $1 that occur first and are repeated at least three times
within all possible strings of 200 consecutive words. 200 is an upper bound for the
analysis performed in [22, pp. 35-48].
Search WWH ::




Custom Search