Information Technology Reference
In-Depth Information
encoded as itself. The representation of the slash / and backslash characters in sed
programs are \/ and \\ respectively ( cf. Appendix A.1).
Application (conditional tagging): A sed program similar to the program above
can be used for conditional tagging. For example, if a file contains one entire sentence
per line, then an Address can be used to conditionally tag (or otherwise process)
certain items/words/phrases in a sentence depending whether or not that sentence
contains a certain (other) key-item that is identified by the Address in the sed
command.
12.3.2 Preprocessing and Formatting Tools
The next simple examples show how text can be preprocessed with small, customized
sed programs such that the output can be used with much more ease for further
processing in a pipe. Alternatively, the code given below may be included in larger
sed programs when needed. However, dividing processes into small entities as given
in the examples below is a very useful technique to isolate reusable components and
to avoid programming mistakes resulting from over-complexity of single programs.
Application (adding blanks for easier pattern matching): The following sh pro-
gram adjusts blanks and tabs in the input file (symbolized by $1 )insuchaway
that it is better suited for certain searches. This program will be often used in what
follows since it makes matching items considerably easier. In what follows, we shall
refer to this program as addBlanks . All ranges [] in the sed program contain a
blank and a tab.
#!/bin/sh
# addBlanks
sed 's/[ ][ ]*/ /g; s/^ */ /;
s/ *$/ /; s/^ *$//' $1
Explanation: First, all strings consisting only of blanks or tabs are normalized to
two blanks. Then, a single blank is placed at the beginning and the end of the pattern
space. Finally, any resulting white pattern space is cleared in the last substitution
command.
Justification: Suppose one wants to search in a file for occurrences of the word
“liberal.” In order to accurately identify the strings Liberal and liberal in raw
text, one needs the following four patterns (compare Appendix A.1):
/[^A-Za-z][Ll]iberal[^A-Za-z]/ /^[Ll]iberal[^A-Za-z]/
/[^A-Za-z][Ll]iberal$/ /^[Ll]iberal$/
If one preprocesses the source file with addBlanks , only the first pattern is needed.
Thus, a sed -based search program for Liberal and liberal is shorter and faster.
Application (Finding words in a text in a crude fashion): The following program
is a variation of addBlanks . It can be used to isolate words in text in a somewhat
crude fashion. In fact, abbreviations and words that contain a hyphen, a slash ( e.g. ,
A/C) or an apostrophe are not properly identified.
#!/bin/sh
# leaveOnlyWords (crude implementation)
sed
's/[^A-Za-z][^A-Za-z]*/
/g;
s/^ */ /
s/ *$/ /;
s/^ *$//'
$1
Search WWH ::




Custom Search