Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Application (Putting non-white strings on separate lines): The following program

is another useful variation of addBlanks . It isolates non-white strings of characters

in a text and puts every such string on a separate line. This is a very good input

format for counting and statistical operations on words. All ranges in the following

program [] contain a blank and a tab. We shall call this oneItemPerLine .

#!/bin/sh

# oneItemPerLine

sed

'/^[

]*$/d;

s/^[

]*//;

s/[

]*$//;

s/[

][

]*/\

/g'

$1

Explanation: First, all white lines are removed by deleting the pattern space

( sed operator d ) which includes terminating the cycle 9 , i.e. , the remainder of the

sed program is not applied to the current pattern space, the current pattern space

is not printed to output, and processing continues with the next line of input. For

non-white lines, white characters at the beginning and the end of lines are removed.

Finally, all remaining strings of white characters are replaced by newline characters.

Remark: Let us note at this point, that sed also has an operator to terminate

the program. This is the operator q (quit). For example, sed '5q' fName prints the

first 5 lines of the file fName , since it quits copying lines to the output (no action)

at line 5.

Application (Normalizing phrases/items on separate lines): The following sh pro-

gram which removes obsolete blanks and tabs in a file $1 is somewhat the inverse

of addBlanks . In what follows, we shall refer to this program as adjustBlankTabs .

Every range [] contains a blank and a tab.

#!/bin/sh

# adjustBlankTabs

sed

's/^[

]*//;

s/[

]*$//;

s/[

][

]*/ /g'

$1

Explanation: All leading and trailing white space (blanks and tabs) is removed

first. Finally, all white strings are replaced by a single blank in the last substitution

command.

Justification: adjustBlankTabs standardizes and minimizes phrases (as strings)

which may automatically be obtained from e-mail messages with inconsistent typing

style or text files that have been justified left and right. This is useful if one wants

to analyze sentences or derive statistics for phrases which should be processed as

unique strings of characters.

Technique: The following program replaces @ by @@ , # by #@ , and _ by ## in an

input file, i.e. , each of the single characters @ , # , and _ is replaced by the corre-

sponding pair (consisting of characters @ and # only) in the order of the substitution

commands from left to right. In what follows, we shall refer to this program as

hideUnderscore .

#!/bin/sh

# hideUnderscore

sed 's/@/@@/g; s/#/#@/g; s/_/##/g' $1

9 See the definition of “cycle” at the beginning of Section 12.3.1.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home