Information Technology Reference
In-Depth Information
The implementations of
hideUnderscore
and
hideAbbreviations
have been dis-
cussed above. Compare also the listing of
leaveOnlyWords
given above.
hideNumbers
replaces,
e.g.
, the string
$1.000.000
by
$1_000_000
, thus, “hiding” the decimal
points in numbers. The next
sed
program listed below defines the ends of sentences.
This is the most important component of the pipe which we show for reference.
1: sed 's/\([^]})'\''".!?][]}).!?]*\)\([!?]\)
\([]})]*\)\([^]})'\''".!?]\)/\1\2\3__\2__\
2: \4/g
3: s/\([^]})'\''".!?][]}).!?]*\)\([!?]\)\([]})]*\)$/\1\2\3__\2__/
4: s/\([^]})'\''".!?][]})'\''".!?]*\.
[]})'\''"]*\)\([^]})'\''".!?]\)/\1__.__\
5: \2/g
6: s/[^]})'\''".!?][]})'\''".!?]*\.[]})'\''"]*$/&__.__/' |
Explanation:
Line 1 of this listing is broken after
\([!?]\)
representing the end
of the sentence . In the first two
sed
commands (lines 1-3), the end of the sentence
for “?” and “!” are defined. The similar treatment of “?” and “!” is implemented
by using a range
[!?]
which is the second tagged entity in the patterns in lines
1 and 3. Thus, the letter ending the sentence is represented by
\2
. The range-
sequence
[^]})'\''".!?]
followed by
[]}).!?]*
defines admissible strings before
the end of a sentence. It is the first tagged entity
\1
in the patterns in lines 1
and 3. The range-sequence represents at least one non-closing character, followed
by a possible sequence of allowed closing characters. A sentence may be multiply
bracketed in various ways. This is represented by the range
[]})]*
which is the
third tagged entity
\3
in the patterns in lines 1 and 3. After the possible bracketing
is finished, there should not follow another closing (brackets, quotes) or terminating
character “.”, “?” or “!”. (This handles exactly the case of the previous sentence.)
The excluded terminating character is encoded as
[^]})'\''".!?]
in line 1, and is
the fourth tagged item
\4
. In the substitution part of the
sed
command in lines 1-2,
the originally tagged sequence (
\1\2\3\4
) is replaced by
\1\2\3__\2__
newline
\4
.
Thus, after the proper ending of the sentence in
\3
, a marker
__\2__
is introduced
for sorting/identification purposes. Then, a
newline
character is introduced such
that the next sentence starting in
\4
starts on a new line. Line 3 handles the case
when the sentence-end in “?” or “!” coincides with the end of the line.
Line 4 of this listing is broken after
\.
representing the period (and not
an arbitrary character) ending the sentence. The last two substitution rules in
lines 4-6 for marking sentences that end in a period are different than those
for “?” and “!”. But the principles are similar. In line 4, the range-sequence
[^]})'\''".!?][]})'\''".!?]
followed by
[]})'\''".!?]*
defines admissible
strings before the end of a sentence. The range-sequence represents at least one non-
closing character, followed by a possible sequence of allowed closing characters. Then
the closing period is explicitly encoded as
\.
The range-sequence
[]})'\''"]*
(clos-
ing brackets) followed by
[^]})'\''".!?]
(non-closing character) defines admissible
strings after the end of a sentence. Line 7 handles the case when the sentence-end
coincides with the end of the line.
Next follows an
awk
program in the pipe which is shown below:
awk 'BEGIN
{ ORS=" " }
{ print }
Search WWH ::
Custom Search