Information Technology Reference
In-Depth Information
12.3 Linguistic Processing with
sed
The
s
tream
ed
itor
sed
is the ideal tool to make replacements in texts. This can be
used to mark, isolate, rearrange and replace strings and string patterns in texts. In
this section, we shall exploit
sed
's capabilities to present a number of small useful
processing devices for linguistic computing which range from preprocessing devices
to grammatical analyzers. All of these applications are essentially based upon simple
substitution rules.
In our philosophy of text processing, the text itself becomes,
e.g.
, through the
introduction of certain markers a program that directs the actions of the UNIX
programs that act on it in a pipe.
12.3.1 Overview of
sed
Programming
A
sed
program operates on a file line-by-line. Roughly speaking, every line of the
input-file is stored in a buffer called the
pattern space
and is worked on therein by
every command line of the entire
sed
program from top to bottom. This is called
a
cycle
. Each
sed
operator that is applied to the content of the pattern space may
alter it. In that case, the previous version of the content of the pattern space is lost.
Subsequent
sed
operators are always applied to the current content of the pattern
space and
not
the original input line. After the cycle is over, the resulting pattern
space is printed/delivered to output,
i.e.
, the output file or the next process in the
UNIX pipe mechanism. Lines that were never worked on are consequently copied to
output by
sed
.
Substitution Programs
The simplest and most commonly used
sed
programs are short substitution
programs. The following example shows a program that replaces the patterns
thing
and
NEWLINE
matching the strings
thing
and
NEWLINE
in all instances in a file
8
by
NOUN
and a
newline
character, respectively:
#!/bin/sh
sed 's/thing/NOUN/g
s/NEWLINE/\
/g'
$1
Explanation:
The setup of the entire
sed
program and the two substitution
commands of this program are very similar to the example in section 12.2.1. The
first
sed
command
s/thing/NOUN/g
consists of four parts: (1)
s
is the
sed
operator
used and stands for “substitute.” (2)
thing
is the pattern that is to be substituted. A
detailed listing of legal patterns in
sed
substitution commands is given in Appendix
A.1. (3)
NOUN
is the replacement for the pattern. (4) The
g
means “globally.” Without
the
g
at the end only the first occurrence of the pattern would be replaced in a line.
The second substitution command shows the important technique of how to
place
newline
characters at specific places in text. This can be used to break pieces
of text into fragments on separate lines for further separate processing. There is
nothing following the trailing backslash
\
which is part of the
sed
program. See the
8
As before, the symbol/string
$1
stands for the filename.
Search WWH ::
Custom Search