Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

every line coming from the original source file $1 which contains at least one four-

letter word) is appended via N to the preceding line in the pipe containing solely the

corresponding line number. 2) The embedded newline character (encoded as \n )is

removed and the two united lines are printed as one.

Application (tagging grammatical entities): The following program shows the

first serious linguistic application of the techniques introduced so far. It marks

all determiners in an input text file symbolized by $1 . We shall refer to it as

markDeterminers .

1: #!/bin/sh

2: # markDeterminers

3: addBlanks $1 | sed 's/\.\.\./_TRIPLE_PERIOD_/g

4: s/$[[{(< '"_]$$[Tt]h[eo]se$$[]})> '\''",?!_.]$/

\1_DETERMINER_\2_\3/g

5: s/$[[{(< '"_]$$[Tt]his$$[]})> '\''",?!_.]$/

\1_DETERMINER_\2_\3/g

6: s/$[[{(< '"_]$$[Tt]hat$$[]})> '\''",?!_.]$/

\1_DETERMINER_\2_\3/g

7: s/$[[{(< '"_]$$[Tt]he$$[]})> '\''",?!_.]$/

\1_DETERMINER_\2_\3/g

8: s/$[[{(< '"_]$$[Aa]n$$[]})> '\''",?!_.]$/

\1_DETERMINER_\2_\3/g

9: s/$[[{(< '"_]$$[Aa]$$[]})> '\''",?!_]$/

\1_DETERMINER_\2_\3/g

10: s/$[[{(< '"_]$$[Aa]$$\.[^A-Za-z]$/\1_DETERMINER_\2_\3/g

11: s/_TRIPLE_PERIOD_/.../g' -

|

adjustBlankTabs -

In the above listing, lines 4-9 are broken at the boundary / of the pattern and

the replacement in the sed substitution commands that are listed. This does not

represent correct code. Line 10 shows, in principle, the “correct” code-listing for any

of these sed substitution commands.

Explanation of the central sed program: The first substitution command (line

3) replaces the triple period as in “Bill bought...a boat and a car.” by the marker

_TRIPLE_PERIOD_ . This distinguishes the period in front of “a” in “...a boat” from an

abbreviation such as “a.s.a.p.” The character preceding a determiner 10 is encoded

left of the determiner in every pattern (lines 4-10) as range [[{(< '"_] , tagged

and reused right 11 as \1 in the replacement in the substitution command. The

determiner which is specified in the middle of every pattern is reused as \2 . It will

be preceded by the marker _DETERMINER_ and followed by an underscore character in

the output of the above program. The non-letter following a determiner is encoded

right of the determiner in the first five patterns (lines 4-8, “those”-“An”) as range

[]})> '\''",?!_.] , tagged and reused as \3 . The string '\'' represents a single '

( cf. section 12.3.1). For the determiner “a” the period is excluded in the characters

that are allowed to follow it in the range []})> '\''",?!_] in line 9. If a period

follows the character a , then a non-letter must follow in order that a represents the

determiner “a”. This is encoded as \.[^A-Za-z] in line 10 of the program. The string

10 This means characters that the authors consider legal to precede a word in text.

11 That is in the continuation of the line below for lines 4-9.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home