Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

12.3 Linguistic Processing with sed

The s tream ed itor sed is the ideal tool to make replacements in texts. This can be

used to mark, isolate, rearrange and replace strings and string patterns in texts. In

this section, we shall exploit sed 's capabilities to present a number of small useful

processing devices for linguistic computing which range from preprocessing devices

to grammatical analyzers. All of these applications are essentially based upon simple

substitution rules.

In our philosophy of text processing, the text itself becomes, e.g. , through the

introduction of certain markers a program that directs the actions of the UNIX

programs that act on it in a pipe.

12.3.1 Overview of sed Programming

A sed program operates on a file line-by-line. Roughly speaking, every line of the

input-file is stored in a buffer called the pattern space and is worked on therein by

every command line of the entire sed program from top to bottom. This is called

a cycle . Each sed operator that is applied to the content of the pattern space may

alter it. In that case, the previous version of the content of the pattern space is lost.

Subsequent sed operators are always applied to the current content of the pattern

space and not the original input line. After the cycle is over, the resulting pattern

space is printed/delivered to output, i.e. , the output file or the next process in the

UNIX pipe mechanism. Lines that were never worked on are consequently copied to

output by sed .

Substitution Programs

The simplest and most commonly used sed programs are short substitution

programs. The following example shows a program that replaces the patterns thing

and NEWLINE matching the strings thing and NEWLINE in all instances in a file 8 by

NOUN and a newline character, respectively:

#!/bin/sh

sed 's/thing/NOUN/g

s/NEWLINE/\

/g'

$1

Explanation: The setup of the entire sed program and the two substitution

commands of this program are very similar to the example in section 12.2.1. The

first sed command s/thing/NOUN/g consists of four parts: (1) s is the sed operator

used and stands for “substitute.” (2) thing is the pattern that is to be substituted. A

detailed listing of legal patterns in sed substitution commands is given in Appendix

A.1. (3) NOUN is the replacement for the pattern. (4) The g means “globally.” Without

the g at the end only the first occurrence of the pattern would be replaced in a line.

The second substitution command shows the important technique of how to

place newline characters at specific places in text. This can be used to break pieces

of text into fragments on separate lines for further separate processing. There is

nothing following the trailing backslash \ which is part of the sed program. See the

8 As before, the symbol/string $1 stands for the filename.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home