Information Technology Reference
In-Depth Information
12.3 Linguistic Processing with sed
The s tream ed itor sed is the ideal tool to make replacements in texts. This can be
used to mark, isolate, rearrange and replace strings and string patterns in texts. In
this section, we shall exploit sed 's capabilities to present a number of small useful
processing devices for linguistic computing which range from preprocessing devices
to grammatical analyzers. All of these applications are essentially based upon simple
substitution rules.
In our philosophy of text processing, the text itself becomes, e.g. , through the
introduction of certain markers a program that directs the actions of the UNIX
programs that act on it in a pipe.
12.3.1 Overview of sed Programming
A sed program operates on a file line-by-line. Roughly speaking, every line of the
input-file is stored in a buffer called the pattern space and is worked on therein by
every command line of the entire sed program from top to bottom. This is called
a cycle . Each sed operator that is applied to the content of the pattern space may
alter it. In that case, the previous version of the content of the pattern space is lost.
Subsequent sed operators are always applied to the current content of the pattern
space and not the original input line. After the cycle is over, the resulting pattern
space is printed/delivered to output, i.e. , the output file or the next process in the
UNIX pipe mechanism. Lines that were never worked on are consequently copied to
output by sed .
Substitution Programs
The simplest and most commonly used sed programs are short substitution
programs. The following example shows a program that replaces the patterns thing
and NEWLINE matching the strings thing and NEWLINE in all instances in a file 8 by
NOUN and a newline character, respectively:
#!/bin/sh
sed 's/thing/NOUN/g
s/NEWLINE/\
/g'
$1
Explanation: The setup of the entire sed program and the two substitution
commands of this program are very similar to the example in section 12.2.1. The
first sed command s/thing/NOUN/g consists of four parts: (1) s is the sed operator
used and stands for “substitute.” (2) thing is the pattern that is to be substituted. A
detailed listing of legal patterns in sed substitution commands is given in Appendix
A.1. (3) NOUN is the replacement for the pattern. (4) The g means “globally.” Without
the g at the end only the first occurrence of the pattern would be replaced in a line.
The second substitution command shows the important technique of how to
place newline characters at specific places in text. This can be used to break pieces
of text into fragments on separate lines for further separate processing. There is
nothing following the trailing backslash \ which is part of the sed program. See the
8 As before, the symbol/string $1 stands for the filename.
 
Search WWH ::




Custom Search