Information Technology Reference
In-Depth Information
encoded as \.[^A-Za-z] is tagged and reused as \3 . After the tagging is completed,
the triple period is restored in line 11. For example, the string "A
liberal?" is
replaced by the program with "_DETERMINER_A_ liberal?" .
Application (grammatical analysis): A collection of tagging programs such as
markDeterminers can be used for elementary grammatical analysis and search for
grammatical patterns in text. If a file contains only one entire sentence per line, then
a pattern /_DETERMINER_.*_DETERMINER_/ would find all sentences that contain at
least two determiners.
Note that the substitution s/_[A-Za-z_]*_//g eliminates everything that has
been tagged thus far.
12.3.4 Turning a Text File into a Program
One can use a sed program to create a program from a file containing data in a
convenient format ( e.g. , a list of words). Such an action can precede the use of the
generated program, i.e. , one invokes the sed program and the generated program
separately. Alternatively, the generation of a program and its subsequent use are
part of a single UNIX command. The latter possibility is outlined next.
Application (removing a list of unimportant words): Suppose that one has a
file that contains a list of words that are “unimportant” for some reason. Suppose
in addition, that one wants to eliminate these unimportant words from a second
text file. For example, function words such as the , a , an , if , then , and , or , ... are
usually the most frequent words but carry less semantic load than content words.
See [8, pp. 219-220] for a list of frequent words. The following program generates
a sed program $1.sed out of a file $1 that contains a list of words deemed to
be “unimportant.” The generated script $1.sed eliminates the unimportant words
from a second file $2 . We shall refer to the following program as eliminateList .
For example, eliminateList unimportantWords largeTextFile removes words in
$1 = unimportantWords from $2 = largeTextFile .
1: #!/bin/sh
2: # eliminateList
3: # First argument $1 is file of removable material.
4: # Second argument $2 is the file from which material is removed.
5: leaveOnlyWords $1
|
oneItemPerLine -
|
6: sed 's/[./-]/\\&/g
7: s/.*/s\/\\([^A-Za-z]\\)&\\([^A-Za-z]\\)\/\\1\\2\/g/
8: ' >$1.sed
9: addBlanks $2
|
sed -f $1.sed
-
|
adjustBlankTabs -
Explanation: Line 5 in the program isolates words in the file $1 and feeds them
(one word per line) into the first sed program starting in line 6. In the first sed
program in lines 6-7 the following is done: 1) Periods, slashes (“A/C”), or hyphens
are preceded by a backslash character. Here, the sed -special character & is used which
reproduces in the replacement of a sed substitution command what was matched
in the pattern of that command (here: the range [./-] ). For example, the string
built-in is replaced by built\-in . This is done since periods and hyphens are
Search WWH ::




Custom Search