Information Technology Reference
In-Depth Information
encoded as
\.[^A-Za-z]
is tagged and reused as
\3
. After the tagging is completed,
the triple period is restored in line 11. For example, the string
"A
liberal?"
is
replaced by the program with
"_DETERMINER_A_ liberal?"
.
Application (grammatical analysis):
A collection of tagging programs such as
markDeterminers
can be used for elementary grammatical analysis and search for
grammatical patterns in text. If a file contains only one entire sentence per line, then
a pattern
/_DETERMINER_.*_DETERMINER_/
would find all sentences that contain at
least two determiners.
Note that the substitution
s/_[A-Za-z_]*_//g
eliminates everything that has
been tagged thus far.
12.3.4 Turning a Text File into a Program
One can use a
sed
program to create a program from a file containing data in a
convenient format (
e.g.
, a list of words). Such an action can precede the use of the
generated program,
i.e.
, one invokes the
sed
program and the generated program
separately. Alternatively, the generation of a program and its subsequent use are
part of a single UNIX command. The latter possibility is outlined next.
Application (removing a list of unimportant words):
Suppose that one has a
file that contains a list of words that are “unimportant” for some reason. Suppose
in addition, that one wants to eliminate these unimportant words from a second
text file. For example, function words such as
the
,
a
,
an
,
if
,
then
,
and
,
or
, ... are
usually the most frequent words but carry less semantic load than content words.
See [8, pp. 219-220] for a list of frequent words. The following program generates
a
sed
program
$1.sed
out of a file
$1
that contains a list of words deemed to
be “unimportant.” The generated script
$1.sed
eliminates the unimportant words
from a second file
$2
. We shall refer to the following program as
eliminateList
.
For example,
eliminateList unimportantWords largeTextFile
removes words in
$1
=
unimportantWords
from
$2
=
largeTextFile
.
1: #!/bin/sh
2: # eliminateList
3: # First argument $1 is file of removable material.
4: # Second argument $2 is the file from which material is removed.
5: leaveOnlyWords $1
|
oneItemPerLine -
|
6: sed 's/[./-]/\\&/g
7: s/.*/s\/\\([^A-Za-z]\\)&\\([^A-Za-z]\\)\/\\1\\2\/g/
8: ' >$1.sed
9: addBlanks $2
|
sed -f $1.sed
-
|
adjustBlankTabs -
Explanation:
Line 5 in the program isolates words in the file
$1
and feeds them
(one word per line) into the first
sed
program starting in line 6. In the first
sed
program in lines 6-7 the following is done: 1) Periods, slashes (“A/C”), or hyphens
are preceded by a backslash character. Here, the
sed
-special character
&
is used which
reproduces in the replacement of a
sed
substitution command what was matched
in the pattern of that command (here: the range
[./-]
). For example, the string
built-in
is replaced by
built\-in
. This is done since periods and hyphens are
Search WWH ::
Custom Search