Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

government agency that typesets and prints the entrance examinations for Japanese

universities. Clearly, if English punctuation rules ( i.e. , spacing rules) are not taught

explicitly, they will not be learned.

A teacher using an automatic punctuation-correction program such as the one

in [39] described below is able to correct nearly all of the students' punctuation

problems, thus presenting the spacing rules in an inductive, interactive way. A

punctuation-correcting program is one of several tools described in [35].

As a database, we have defined a list of forbidden pairs of characters. This

is achieved by listing the matrix M pertaining to the relation R which is given

by char 1 R char 2 ⇔ “The character sequence char 1 char 2 is forbidden.” During

the setup phase of the system used in [39], the matrix M is translated by an sed

program into a new sed program which scans the essays submitted by students

via electronic mail for mistakes. Examples for forbidden sequences are blank , or ' ? .

These mistakes are marked, and the marked essays are sent back to the individual

students automatically. The translation into a sed program during setup works in

the same way as the generation of an elimination program shown above in Section

12.3.4. The resulting marking program is very similar to markDeterminers . Su ce

it to say that this automated, persistent approach to correcting punctuation has

been an immediate and dramatic success [39].

Finally, let us remark that our procedure for identifying mistakes in punctuation

can also be used in analyses of punctuation patterns, frequency, and use, as in [36].

12.5.2 Extracting Sentences

In [39], one of the tools reformats student essays in such a way that entire sentences

are on single lines. Such a format is very useful in two ways:

Goal 1: To select actual student sentences which match certain patterns. The teacher

can then write any number of programs that search for strings identified as partic-

ularly problematic for a given group of students. For example, the words “because”

and “too” are frequently used incorrectly by Japanese speakers of English. Further-

more, once those strings have been identified, the sentences containing them can

be saved in separate files according to the strings and printed as lists of individual

sentences. Such lists can then be given to students in subsequent lessons dealing

with the problem areas for the students to read and determine whether they are

correct or incorrect, and if incorrect, how to fix them.

Goal 2: To analyze example sentences. One example is to measure the complexity of

grammatical patterns used by students using components such as markDeterminers .

This can be used to show the decrease or increase of certain patterns over time

using special sed based search programs and, e.g. , countFrequencies as well as

mathematica for display.

Our procedure for identifying sentences achieves a high level of accuracy without

relying on proper spacing as a cue for sentence division, as does another highly

accurate divider [26].

The following shows part of the implementation of sentence identification in [39]:

#!/bin/sh

hideUnderscore $1 | hideAbbreviations - |

hideNumbers

- | adjustBlankTabs

- |

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home