Information Technology Reference
In-Depth Information
The first
awk
command
{ number[$0]++ }
increments a counter variable
number[
string
]
by 1, if the
string
is the content of the current line(=
$0
). For every
occurring
string
, the counter
number[
string
]
is automatically initiated to 0. The
character sequence
++
means “increase by one.” If every line contains a single word,
then at the end of the file, the counter variable
number[
word
]
contains the number
of occurrences of that particular
word
.The
awk
command in the last line prints the
string
s which were encountered together with the number of occurrences of these
string
satthe
END
of processing. As in Section 12.2.1, the trailing
$1
stands for the
input file.
12.2.3 Using the Pipe Mechanism
Combining
lowerCaseStrings
and
countFrequencies
, we create a UNIX command
wordFrequencyCount
as follows:
#!/bin/sh
# wordFrequencyCount
lowerCaseStrings $1 >intermediateFile
countFrequencies intermediateFile
The command
wordFrequencyCount
is used as
wordFrequencyCount tFile
where
tFile
is any plain-text file.
Explanation:
lowerCaseStrings $1
applies
lowerCaseStrings
to the first argu-
ment (string, filename) after
wordFrequencyCount
(
cf.
Section 12.2.1). The resulting
output is then written/redirected via
>
to the file
intermediate File
, which is cre-
ated if non-existent and
overwritten
if in existence
7
.
inter mediateFile
stays in
existence after
wordFrequencyCount
terminates and can be further used. Finally,
countFrequencies intermediateFile
applies the word count to the intermediate
result.
Instead of using
intermediateFile
, one can let the UNIX system handle the
transfer (piping) of intermediate results from one program to another. The follow-
ing
sh
program is completely equivalent to the first listing of
wordFrequencyCount
except that the intermediate result is stored nowhere:
#!/bin/sh
# wordFrequencyCount (2nd implementation)
lowerCaseStrings $1 | countFrequencies -
Explanation:
The pipe symbol
|
causes the transfer (piping) of intermediate
results from
lowerCaseStrings
to
countFrequencies
. The pipe symbol
|
or the
string
|\
can terminate a line, in which case the pipe is continued into the next line.
The trailing hyphen symbolizes the virtual file (in UNIX-jargon called “standard
input”) that is the input file for
countFrequencies
.
We observe that the output of
countFrequencies
is not sorted. The reader may
want to replace the last line in the program by
lowerCaseStrings $1 | countFrequencies - | sort -
employing the UNIX command
sort
as the final step in the processing.
Additional information about programming
sh
and the UNIX commands men-
tioned above can be obtained using the
man sh
command as well as consulting [30].
7
>>
instead of
>
appends to an existing file.
Search WWH ::
Custom Search