Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

The following program is the inverse of hideUnderscore . In what follows, we shall

refer to this inverse program as restoreUnderscore . Observe for the verification of

the program that sed scans the pattern space from left to right.

#!/bin/sh

# restoreUnderscore

sed

's/##/_/g;

s/#@/#/g;

s/@@/@/g'

$1

Application (using a hidden character as a marker in text): Being able to let

a character (here the underscore) “disappear” in text at the beginning of a pipe

is extremely useful. That character can be used to “break” complicated, general

patterns to mark exceptions. See the use of this technique in the implementations

of leaveOnlyWords and markDeterminers in Section 12.3.3. Entities that have been

recognized in text can be marked by keywords of the sort _NOUN_ . Framed by under-

score characters, these keywords are easily distinguishable from regular words in the

text. At the end of the pipe, all keywords are usually gone or properly formatted,

and the “missing” character is restored.

Another application is to recognize the ends of sentences in the case of the

period character. The period appears also in numbers and in abbreviations. By

first replacing the period in the two latter cases by an underscore character and

then interpreting the period as a marker for the ends of sentences is, with minor

additions, one way to generate a file which contains one entire sentence per line.

12.3.3 Tagging Linguistic Items

The tagged regular expression mechanism is the most powerful programming device

in sed . This mechanism is not available in such simplicity in awk . It can be used

to extend, divide and rearrange patterns and their parts. Up to nine chunks of the

pattern in a substitution command can be framed (tagged) using the strings \( and

\) .

Example: Consider the pattern /[0-9][0-9]*\.[0-9]*/ which matches decimal

numbers such as 10. or 3.1415 . Tagging the integer-part [0-9][0-9]* ( i.e. , what is

positioned

left

of

the

period

character)

in

the

above

pattern

yields

/$[0-9][0-9]*$\.[0-9]*/ .

The tagged and matched (recognized) strings can be reused in the pattern and

the replacement in the substitution command as \1 , \2 , \3 ... counting from left to

right. We point out to the reader that the order of \1 ... \9 standing for tagged regular

sub-expressions need not be retained. Thus, rearrangement of tagged expressions is

possible in the replacement in a substitution command.

Example: The substitution command s/$.$\1/DOUBLE\1/g matches double

characters such as oo , 11 or && in the pattern /$.$\1/ and replaces them with

DOUBLEo , DOUBLE1 or DOUBLE& respectively. More detail about the usage of tagged

regular expressions is given in the following three examples.

Application (identifying words in text): The following program shows how one

can properly identify words in text. We shall refer to it as leaveOnlyWords .(This

is the longest program listing in this chapter.)

1: #!/bin/sh

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home