Information Technology Reference
In-Depth Information
Application (Putting non-white strings on separate lines): The following program
is another useful variation of addBlanks . It isolates non-white strings of characters
in a text and puts every such string on a separate line. This is a very good input
format for counting and statistical operations on words. All ranges in the following
program [] contain a blank and a tab. We shall call this oneItemPerLine .
#!/bin/sh
# oneItemPerLine
sed
'/^[
]*$/d;
s/^[
]*//;
s/[
]*$//;
s/[
][
]*/\
/g'
$1
Explanation: First, all white lines are removed by deleting the pattern space
( sed operator d ) which includes terminating the cycle 9 , i.e. , the remainder of the
sed program is not applied to the current pattern space, the current pattern space
is not printed to output, and processing continues with the next line of input. For
non-white lines, white characters at the beginning and the end of lines are removed.
Finally, all remaining strings of white characters are replaced by newline characters.
Remark: Let us note at this point, that sed also has an operator to terminate
the program. This is the operator q (quit). For example, sed '5q' fName prints the
first 5 lines of the file fName , since it quits copying lines to the output (no action)
at line 5.
Application (Normalizing phrases/items on separate lines): The following sh pro-
gram which removes obsolete blanks and tabs in a file $1 is somewhat the inverse
of addBlanks . In what follows, we shall refer to this program as adjustBlankTabs .
Every range [] contains a blank and a tab.
#!/bin/sh
# adjustBlankTabs
sed
's/^[
]*//;
s/[
]*$//;
s/[
][
]*/ /g'
$1
Explanation: All leading and trailing white space (blanks and tabs) is removed
first. Finally, all white strings are replaced by a single blank in the last substitution
command.
Justification: adjustBlankTabs standardizes and minimizes phrases (as strings)
which may automatically be obtained from e-mail messages with inconsistent typing
style or text files that have been justified left and right. This is useful if one wants
to analyze sentences or derive statistics for phrases which should be processed as
unique strings of characters.
Technique: The following program replaces @ by @@ , # by #@ , and _ by ## in an
input file, i.e. , each of the single characters @ , # , and _ is replaced by the corre-
sponding pair (consisting of characters @ and # only) in the order of the substitution
commands from left to right. In what follows, we shall refer to this program as
hideUnderscore .
#!/bin/sh
# hideUnderscore
sed 's/@/@@/g; s/#/#@/g; s/_/##/g' $1
9 See the definition of “cycle” at the beginning of Section 12.3.1.
Search WWH ::




Custom Search