Information Technology Reference
In-Depth Information
2: # leaveOnlyWords
3: sed 's/[^A-Za-z.'\''/-][^A-Za-z.'\''/-]*/ /g
4: s/\([A-Za-z][A-Za-z]*\)\.\([A-Za-z][A-Za-z]*\)\./\1_\2_/g
5: s/\([A-Za-z][A-Za-z]*_[A-Za-z][A-Za-z]*\)\./\1_/g
6: s/Am\./Am_/g; s/Ave\./Ave_/g; s/Bart\./Bart_/g;
7: # The list of substitution commands continues ...
8: s/vols\./vols_/g; s/vs\./vs_/g; s/wt\./wt_/g;
9: s/\./ /g; s/_/./g
10: s/\([A-Za-z]\)\-\([A-Za-z]\)/\1_\2/g; s/\-/ /g; s/_/-/g
11: s/\([A-Za-z]\)\/\([A-Za-z]\)/\1_\2/g; s/\-/ /g; s/_/\//g
12: s/\([A-Za-z]\)'\''\([A-Za-z]\)/\1_\2/g; s/'\''/ /g; s/_/'\''/g
13: '
$1
Explanation: First, all strings which do not contain a letter, a period, an apos-
trophe, a slash or a hyphen are replaced by a blank (line 3). At this moment, the
pattern space does not contain any underscore character which is subsequently used
as a marker. The marker ( _ ) is first used to symbolize period characters that are
a part of words (abbreviations) and need to be retained. Next (lines 4-5), strings
of the type letters . letters . are replaced by letters _ letters _ . For example, v.i.p. is
replaced by v_i_p. Following that, strings of the type letters _ letters . are replaced
by letters _ letters _ . For example, v_i_p. is then replaced by v_i_p_ . Next (lines 6-8)
comes a collection of substitution commands that replaces the period in standard
abbreviations with an underscore character. Then (line 9), all remaining period char-
acters are replaced by blanks (deleted) and subsequently all underscore characters
by periods (restored). Next (line 10), every hyphen which is embedded between two
letters is replaced by an underscore character. All other hyphens are then replaced
by blanks (deleted), and subsequently all underscore characters are replaced by hy-
phens (restored). Finally (lines 11-12), the slash (encoded as \/ ) and the apostrophe
(encoded as '\'' , cf. Section 12.3.1) are treated in a similar way as the hyphen.
Example: The following program finds all four-letter words in a text. The pro-
gram shows the usefulness of, in particular, addBlanks in simplifying pattern match-
ing. We shall refer to it as findFourLetterWords .
#!/bin/sh
# findFourLetterWords (sed version)
leaveOnlyWords $1
|
addBlanks -
|
sed
's/ \([A-Za-z][a-z][a-z][a-z]\) /_\1/g;
s/ [^_][^_]* //g;
/^$/d;
s/_/ /g;
='-|
sed
'N;
s/\n/ /' -
Explanation: The first sed program acts as follows: 1) All four-letter words are
marked with a leading underscore character. 2) All unmarked words are deleted. 3)
Resulting white pattern spaces (lines) are deleted which also means that the cycle is
interrupted and neither the line nor the corresponding line number are subsequently
printed. 4) Underscore characters in the pattern space are replaced by blanks. 5)
Using the sed -operator = , the line number is printed before the pattern space is.
This will occur only if a four-letter word was found on a line. The output is piped
into the second sed program which merges corresponding numbers and lines: 1)
Using the sed -operator N (new line appended), every second line in the pipe ( i.e. ,
Search WWH ::




Custom Search