Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

/__[!?.]__$/ { print "\n" }' | ...

Explanation: The program merges lines that are not marked as sentence endings

by setting the output record separator ORS to a blank. If a line-end is marked as

sentence-end, then an extra newline character is printed.

Next, we merge all lines which start, e.g. , in a lower case word with its predeces-

sor since this indicates that we have identified a sentence within a sentence. Finally,

markers are removed and the “hidden” things are restored in the pipe. By conven-

tion, we deliberately accept that an abbreviation does not terminate a sentence.

Overall, our procedure creates double sentences on lines in rare cases. Nevertheless,

this program is su ciently accurate for the objectives outlined above in (1) and (2).

Note that it is easy to scan the output for lines possibly containing two sentences

and subsequently inspect a “diagnostic” file.

Application: The string “and so on” is extremely common in the writing of

Japanese learners of English, and it is objected to by most teachers. From the

examples listed above such as printPredecessorBecause , it is clear how to connect

the output of the sentence finder with a program that searches for and so on .

In [46], 121 very common mistakes made by Japanese students of English are

documented. We point out to the reader that a full 75 of these can be located

in student writing using the most simple of string-search programs, such as those

introduced above.

12.5.3 Readability of Texts

Hoey [22, pp. 35-48, 231-235] points out that the more cohesive a foreign language

text, the easier it is for learners of the language to read. One method Hoey proposes

for judging relative cohesion, and thus readability, is by merely counting the number

of repeated content words in the text (repetition being one of the main cohesive

elements of texts in many languages). Hoey concedes though that doing this “rough

and ready analysis” [22, p. 235] by hand is tedious work, and impractical for texts

of more than 25 sentences.

An analysis like this is perfectly suited for the computer, however. In principle,

any on-line text could be analyzed in terms of readability based on repetition. One

can use countWordFrequencies or a similar program to determine word frequencies

over an entire text or “locally.” Entities to search through “locally” could be para-

graphs or all blocks of, e.g. , 20 lines of text. The latter procedure would define a

flow-like concept that could be called “local context.” Words that appear at least

once with high local frequency are understood to be important. A possible exten-

sion of countWordFrequencies is to use spell -x to identify derived words such as

Japanese from Japan . Such a procedure aids teachers in deciding which vocabulary

items to focus on when assigning students to read the text, i.e. , the most frequently

occurring ones ordered by their appearance in the text.

Example: The next program implements a search for words that are locally

repeated ( i.e. , within a string of 200 words) in a text. In fact, we determine the

frequencies of words in a file $1 that occur first and are repeated at least three times

within all possible strings of 200 consecutive words. 200 is an upper bound for the

analysis performed in [22, pp. 35-48].

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home