Information Technology Reference
In-Depth Information
The implementations of hideUnderscore and hideAbbreviations have been dis-
cussed above. Compare also the listing of leaveOnlyWords given above. hideNumbers
replaces, e.g. , the string $1.000.000 by $1_000_000 , thus, “hiding” the decimal
points in numbers. The next sed program listed below defines the ends of sentences.
This is the most important component of the pipe which we show for reference.
1: sed 's/\([^]})'\''".!?][]}).!?]*\)\([!?]\)
\([]})]*\)\([^]})'\''".!?]\)/\1\2\3__\2__\
2: \4/g
3: s/\([^]})'\''".!?][]}).!?]*\)\([!?]\)\([]})]*\)$/\1\2\3__\2__/
4: s/\([^]})'\''".!?][]})'\''".!?]*\.
[]})'\''"]*\)\([^]})'\''".!?]\)/\1__.__\
5: \2/g
6: s/[^]})'\''".!?][]})'\''".!?]*\.[]})'\''"]*$/&__.__/' |
Explanation: Line 1 of this listing is broken after \([!?]\) representing the end
of the sentence . In the first two sed commands (lines 1-3), the end of the sentence
for “?” and “!” are defined. The similar treatment of “?” and “!” is implemented
by using a range [!?] which is the second tagged entity in the patterns in lines
1 and 3. Thus, the letter ending the sentence is represented by \2 . The range-
sequence [^]})'\''".!?] followed by []}).!?]* defines admissible strings before
the end of a sentence. It is the first tagged entity \1 in the patterns in lines 1
and 3. The range-sequence represents at least one non-closing character, followed
by a possible sequence of allowed closing characters. A sentence may be multiply
bracketed in various ways. This is represented by the range []})]* which is the
third tagged entity \3 in the patterns in lines 1 and 3. After the possible bracketing
is finished, there should not follow another closing (brackets, quotes) or terminating
character “.”, “?” or “!”. (This handles exactly the case of the previous sentence.)
The excluded terminating character is encoded as [^]})'\''".!?] in line 1, and is
the fourth tagged item \4 . In the substitution part of the sed command in lines 1-2,
the originally tagged sequence ( \1\2\3\4 ) is replaced by \1\2\3__\2__ newline \4 .
Thus, after the proper ending of the sentence in \3 , a marker __\2__ is introduced
for sorting/identification purposes. Then, a newline character is introduced such
that the next sentence starting in \4 starts on a new line. Line 3 handles the case
when the sentence-end in “?” or “!” coincides with the end of the line.
Line 4 of this listing is broken after \. representing the period (and not
an arbitrary character) ending the sentence. The last two substitution rules in
lines 4-6 for marking sentences that end in a period are different than those
for “?” and “!”. But the principles are similar. In line 4, the range-sequence
[^]})'\''".!?][]})'\''".!?] followed by []})'\''".!?]* defines admissible
strings before the end of a sentence. The range-sequence represents at least one non-
closing character, followed by a possible sequence of allowed closing characters. Then
the closing period is explicitly encoded as \. The range-sequence []})'\''"]* (clos-
ing brackets) followed by [^]})'\''".!?] (non-closing character) defines admissible
strings after the end of a sentence. Line 7 handles the case when the sentence-end
coincides with the end of the line.
Next follows an awk program in the pipe which is shown below:
awk 'BEGIN
{ ORS=" " }
{ print }
Search WWH ::




Custom Search