Information Technology Reference
In-Depth Information
12
Linguistic Computing with UNIX Tools
Lothar M. Schmitt, Kiel Christianson, and Renu Gupta
12.1 Introduction
This chapter presents an outline of applications to language analysis that open up
through the combined use of two simple yet powerful programming languages with
particularly short descriptions: sed and awk . We shall demonstrate how these two
UNIX 1 tools can be used to implement small, useful and customized applications
ranging from text-formatting and text-transforming to sophisticated linguistic com-
puting. Thus, the user becomes independent of sometimes bulky software packages
which may be di cult to customize for particular purposes.
To demonstrate the point, let us list two lines of code which rival an application
of “The Oxford Concordance Program OCP2” [21]. Using OCP2, [28] conducted
an analysis of collocations occurring with “between” and “through.” The following
simple UNIX pipe ( cf. section 12.2.3) performs a similar analysis:
#!/bin/sh
leaveOnlyWords $1| oneItemPerLine -| mapToLowerCase -| context - 20|
awk '(($1~/^between$/)||($(NF)~/^between$/))&&($0~/ through /)' -
Each of the programs in the above pipe shall be explained in detail in this chapter
and contains 3 to 12 essential 2 lines of code.
This chapter is a continuation of [40] where a short, more programming-oriented
tutorial introduction to the use of sed and awk for language analysis can be found
including a detailed listing of all operators. A large number of references to [40]
including mirror-listings can be found on the internet. A recommended alternative
to consulting [40] as supplement to this chapter is reading the manual pages 3 for
sed and awk . In addition, it is recommended (but not necessary) to read [4] and the
introductions to sed and awk in [30].
1 The term UNIX shall stand in the remainder of this chapter for “UNIX or
LINUX.”
2 The procedure leaveOnlyWords is lengthy because it contains one trivial line of
code per single-period-abbreviation such as “ Am. ”. It can be computer-generated
using sed from a list of such abbreviations ( cf. section 12.3.4).
3 Type man sed and man awk in a UNIX terminal window under any shell.
 
Search WWH ::




Custom Search