Information Technology Reference
In-Depth Information
every line coming from the original source file $1 which contains at least one four-
letter word) is appended via N to the preceding line in the pipe containing solely the
corresponding line number. 2) The embedded newline character (encoded as \n )is
removed and the two united lines are printed as one.
Application (tagging grammatical entities): The following program shows the
first serious linguistic application of the techniques introduced so far. It marks
all determiners in an input text file symbolized by $1 . We shall refer to it as
markDeterminers .
1: #!/bin/sh
2: # markDeterminers
3: addBlanks $1 | sed 's/\.\.\./_TRIPLE_PERIOD_/g
4: s/\([[{(< '"_]\)\([Tt]h[eo]se\)\([]})> '\''",?!_.]\)/
\1_DETERMINER_\2_\3/g
5: s/\([[{(< '"_]\)\([Tt]his\)\([]})> '\''",?!_.]\)/
\1_DETERMINER_\2_\3/g
6: s/\([[{(< '"_]\)\([Tt]hat\)\([]})> '\''",?!_.]\)/
\1_DETERMINER_\2_\3/g
7: s/\([[{(< '"_]\)\([Tt]he\)\([]})> '\''",?!_.]\)/
\1_DETERMINER_\2_\3/g
8: s/\([[{(< '"_]\)\([Aa]n\)\([]})> '\''",?!_.]\)/
\1_DETERMINER_\2_\3/g
9: s/\([[{(< '"_]\)\([Aa]\)\([]})> '\''",?!_]\)/
\1_DETERMINER_\2_\3/g
10: s/\([[{(< '"_]\)\([Aa]\)\(\.[^A-Za-z]\)/\1_DETERMINER_\2_\3/g
11: s/_TRIPLE_PERIOD_/.../g' -
|
adjustBlankTabs -
In the above listing, lines 4-9 are broken at the boundary / of the pattern and
the replacement in the sed substitution commands that are listed. This does not
represent correct code. Line 10 shows, in principle, the “correct” code-listing for any
of these sed substitution commands.
Explanation of the central sed program: The first substitution command (line
3) replaces the triple period as in “Bill bought...a boat and a car.” by the marker
_TRIPLE_PERIOD_ . This distinguishes the period in front of “a” in “...a boat” from an
abbreviation such as “a.s.a.p.” The character preceding a determiner 10 is encoded
left of the determiner in every pattern (lines 4-10) as range [[{(< '"_] , tagged
and reused right 11 as \1 in the replacement in the substitution command. The
determiner which is specified in the middle of every pattern is reused as \2 . It will
be preceded by the marker _DETERMINER_ and followed by an underscore character in
the output of the above program. The non-letter following a determiner is encoded
right of the determiner in the first five patterns (lines 4-8, “those”-“An”) as range
[]})> '\''",?!_.] , tagged and reused as \3 . The string '\'' represents a single '
( cf. section 12.3.1). For the determiner “a” the period is excluded in the characters
that are allowed to follow it in the range []})> '\''",?!_] in line 9. If a period
follows the character a , then a non-letter must follow in order that a represents the
determiner “a”. This is encoded as \.[^A-Za-z] in line 10 of the program. The string
10 This means characters that the authors consider legal to precede a word in text.
11 That is in the continuation of the line below for lines 4-9.
Search WWH ::




Custom Search