Information Technology Reference
In-Depth Information
The following program is the inverse of hideUnderscore . In what follows, we shall
refer to this inverse program as restoreUnderscore . Observe for the verification of
the program that sed scans the pattern space from left to right.
#!/bin/sh
# restoreUnderscore
sed
's/##/_/g;
s/#@/#/g;
s/@@/@/g'
$1
Application (using a hidden character as a marker in text): Being able to let
a character (here the underscore) “disappear” in text at the beginning of a pipe
is extremely useful. That character can be used to “break” complicated, general
patterns to mark exceptions. See the use of this technique in the implementations
of leaveOnlyWords and markDeterminers in Section 12.3.3. Entities that have been
recognized in text can be marked by keywords of the sort _NOUN_ . Framed by under-
score characters, these keywords are easily distinguishable from regular words in the
text. At the end of the pipe, all keywords are usually gone or properly formatted,
and the “missing” character is restored.
Another application is to recognize the ends of sentences in the case of the
period character. The period appears also in numbers and in abbreviations. By
first replacing the period in the two latter cases by an underscore character and
then interpreting the period as a marker for the ends of sentences is, with minor
additions, one way to generate a file which contains one entire sentence per line.
12.3.3 Tagging Linguistic Items
The tagged regular expression mechanism is the most powerful programming device
in sed . This mechanism is not available in such simplicity in awk . It can be used
to extend, divide and rearrange patterns and their parts. Up to nine chunks of the
pattern in a substitution command can be framed (tagged) using the strings \( and
\) .
Example: Consider the pattern /[0-9][0-9]*\.[0-9]*/ which matches decimal
numbers such as 10. or 3.1415 . Tagging the integer-part [0-9][0-9]* ( i.e. , what is
positioned
left
of
the
period
character)
in
the
above
pattern
yields
/\([0-9][0-9]*\)\.[0-9]*/ .
The tagged and matched (recognized) strings can be reused in the pattern and
the replacement in the substitution command as \1 , \2 , \3 ... counting from left to
right. We point out to the reader that the order of \1 ... \9 standing for tagged regular
sub-expressions need not be retained. Thus, rearrangement of tagged expressions is
possible in the replacement in a substitution command.
Example: The substitution command s/\(.\)\1/DOUBLE\1/g matches double
characters such as oo , 11 or && in the pattern /\(.\)\1/ and replaces them with
DOUBLEo , DOUBLE1 or DOUBLE& respectively. More detail about the usage of tagged
regular expressions is given in the following three examples.
Application (identifying words in text): The following program shows how one
can properly identify words in text. We shall refer to it as leaveOnlyWords .(This
is the longest program listing in this chapter.)
1: #!/bin/sh
Search WWH ::




Custom Search