Information Technology Reference
In-Depth Information
Our first objective was to merge both texts line-by-line and to extract from
every pair of lines the relevant information. Merging was achieved through pat-
tern matching, observing that not all but most lines correspond one-to-one in both
sources. Kanji were identified through use of the sed operator l 19 . As outlined in
Section 12.3.5, control sequences were eliminated from the RTF format file but the
information some of them represent was retained.
After the source files were properly cleaned by sed and the different pieces from
the two sources identified (tagged), awk was used to generate a format from which all
sorts of applications are now possible. The source file of [37] is typed regularly enough
such that the three categories of entry radical , kanji and compound can be identified
using pattern matching. In fact, a small grammar was defined for the structure of the
source file of [37] and verified with awk . By simply counting all units, an index for the
dictionary which does not exist in [37] can now be generated. This is useful in finding
compounds in a search over the database and was previously impossible. In addition,
all relevant pieces of data in the generated format can be picked by awk as fields and
framed with, e.g. , prolog syntax. It is also easy to generate, e.g. , English →kanji
or English kun dictionaries from this kanji→on /kun English dictionary using
the UNIX command sort and rearrangement of fields. In addition, it is easy to
reformat [37] into proper jlatex format. This could be used to re-typeset the entire
dictionary.
12.6 Conclusion
In the previous exposition, we have given a short but detailed introduction to sed
and awk and their applications to language analysis. We have shown that developing
sophisticated tools with sed and awk is easy even for the computer novice. In addi-
tion, we have demonstrated how to write customized filters with particularly short
code that can be combined in the UNIX environment to create powerful processing
devices particularly useful in language research.
Applications are searches of words, phrases, and sentences that contain inter-
esting or critical grammatical patterns in any machine readable text for research
and teaching purposes. We have also shown how certain search or tagging programs
can be generated automatically from simple word lists. Part of the search routines
outlined above can be used to assist the instructor of English as a second language
through automated management of homework submitted by students through elec-
tronic mail [39]. This management includes partial evaluation, correction and an-
swering of the homework by machine using programs written in sed and/or awk .In
that regard, we have also shown how to implement a punctuation checker.
Another class of applications is the use of sed and awk in concordancing. A
few lines of code can substitute for an entire commercial programming package. We
have shown how to duplicate in a simple way searches performed by large third-party
packages. Our examples include concordancing for pairs of words, other more general
patterns, and the judgement of readability of text. The result of such searches can be
sorted and displayed by machine for subsequent human analysis. Another possibility
19 The sed operator l lists the pattern space on the output in an unambiguous form.
In particular, non-printing characters are spelled in two-digit ascii and long lines
are folded.
Search WWH ::




Custom Search