Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Our first objective was to merge both texts line-by-line and to extract from

every pair of lines the relevant information. Merging was achieved through pat-

tern matching, observing that not all but most lines correspond one-to-one in both

sources. Kanji were identified through use of the sed operator l 19 . As outlined in

Section 12.3.5, control sequences were eliminated from the RTF format file but the

information some of them represent was retained.

After the source files were properly cleaned by sed and the different pieces from

the two sources identified (tagged), awk was used to generate a format from which all

sorts of applications are now possible. The source file of [37] is typed regularly enough

such that the three categories of entry radical , kanji and compound can be identified

using pattern matching. In fact, a small grammar was defined for the structure of the

source file of [37] and verified with awk . By simply counting all units, an index for the

dictionary which does not exist in [37] can now be generated. This is useful in finding

compounds in a search over the database and was previously impossible. In addition,

all relevant pieces of data in the generated format can be picked by awk as fields and

framed with, e.g. , prolog syntax. It is also easy to generate, e.g. , English →kanji

or English → kun dictionaries from this kanji→on /kun → English dictionary using

the UNIX command sort and rearrangement of fields. In addition, it is easy to

reformat [37] into proper jlatex format. This could be used to re-typeset the entire

dictionary.

12.6 Conclusion

In the previous exposition, we have given a short but detailed introduction to sed

and awk and their applications to language analysis. We have shown that developing

sophisticated tools with sed and awk is easy even for the computer novice. In addi-

tion, we have demonstrated how to write customized filters with particularly short

code that can be combined in the UNIX environment to create powerful processing

devices particularly useful in language research.

Applications are searches of words, phrases, and sentences that contain inter-

esting or critical grammatical patterns in any machine readable text for research

and teaching purposes. We have also shown how certain search or tagging programs

can be generated automatically from simple word lists. Part of the search routines

outlined above can be used to assist the instructor of English as a second language

through automated management of homework submitted by students through elec-

tronic mail [39]. This management includes partial evaluation, correction and an-

swering of the homework by machine using programs written in sed and/or awk .In

that regard, we have also shown how to implement a punctuation checker.

Another class of applications is the use of sed and awk in concordancing. A

few lines of code can substitute for an entire commercial programming package. We

have shown how to duplicate in a simple way searches performed by large third-party

packages. Our examples include concordancing for pairs of words, other more general

patterns, and the judgement of readability of text. The result of such searches can be

sorted and displayed by machine for subsequent human analysis. Another possibility

19 The sed operator l lists the pattern space on the output in an unambiguous form.

In particular, non-printing characters are spelled in two-digit ascii and long lines

are folded.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home