Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

The implementations of hideUnderscore and hideAbbreviations have been dis-

cussed above. Compare also the listing of leaveOnlyWords given above. hideNumbers

replaces, e.g. , the string $1.000.000 by $1_000_000 , thus, “hiding” the decimal

points in numbers. The next sed program listed below defines the ends of sentences.

This is the most important component of the pipe which we show for reference.

1: sed 's/$[^]})'\''".!?][]}).!?]*$$[!?]$

$[]})]*$$[^]})'\''".!?]$/\1\2\3__\2__\

2: \4/g

3: s/$[^]})'\''".!?][]}).!?]*$$[!?]$$[]})]*$$/\1\2\3__\2__/

4: s/\([^]})'\''".!?][]})'\''".!?]*\.

[]})'\''"]*\)$[^]})'\''".!?]$/\1__.__\

5: \2/g

6: s/[^]})'\''".!?][]})'\''".!?]*\.[]})'\''"]*$/&__.__/' |

Explanation: Line 1 of this listing is broken after $[!?]$ representing the end

of the sentence . In the first two sed commands (lines 1-3), the end of the sentence

for “?” and “!” are defined. The similar treatment of “?” and “!” is implemented

by using a range [!?] which is the second tagged entity in the patterns in lines

1 and 3. Thus, the letter ending the sentence is represented by \2 . The range-

sequence [^]})'\''".!?] followed by []}).!?]* defines admissible strings before

the end of a sentence. It is the first tagged entity \1 in the patterns in lines 1

and 3. The range-sequence represents at least one non-closing character, followed

by a possible sequence of allowed closing characters. A sentence may be multiply

bracketed in various ways. This is represented by the range []})]* which is the

third tagged entity \3 in the patterns in lines 1 and 3. After the possible bracketing

is finished, there should not follow another closing (brackets, quotes) or terminating

character “.”, “?” or “!”. (This handles exactly the case of the previous sentence.)

The excluded terminating character is encoded as [^]})'\''".!?] in line 1, and is

the fourth tagged item \4 . In the substitution part of the sed command in lines 1-2,

the originally tagged sequence ( \1\2\3\4 ) is replaced by \1\2\3__\2__ newline \4 .

Thus, after the proper ending of the sentence in \3 , a marker __\2__ is introduced

for sorting/identification purposes. Then, a newline character is introduced such

that the next sentence starting in \4 starts on a new line. Line 3 handles the case

when the sentence-end in “?” or “!” coincides with the end of the line.

Line 4 of this listing is broken after \. representing the period (and not

an arbitrary character) ending the sentence. The last two substitution rules in

lines 4-6 for marking sentences that end in a period are different than those

for “?” and “!”. But the principles are similar. In line 4, the range-sequence

[^]})'\''".!?][]})'\''".!?] followed by []})'\''".!?]* defines admissible

strings before the end of a sentence. The range-sequence represents at least one non-

closing character, followed by a possible sequence of allowed closing characters. Then

the closing period is explicitly encoded as \. The range-sequence []})'\''"]* (clos-

ing brackets) followed by [^]})'\''".!?] (non-closing character) defines admissible

strings after the end of a sentence. Line 7 handles the case when the sentence-end

coincides with the end of the line.

Next follows an awk program in the pipe which is shown below:

awk 'BEGIN

{ ORS=" " }

{ print }

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home