Information Technology Reference
In-Depth Information
The first awk command { number[$0]++ } increments a counter variable
number[ string ] by 1, if the string is the content of the current line(= $0 ). For every
occurring string , the counter number[ string ] is automatically initiated to 0. The
character sequence ++ means “increase by one.” If every line contains a single word,
then at the end of the file, the counter variable number[ word ] contains the number
of occurrences of that particular word .The awk command in the last line prints the
string s which were encountered together with the number of occurrences of these
string satthe END of processing. As in Section 12.2.1, the trailing $1 stands for the
input file.
12.2.3 Using the Pipe Mechanism
Combining lowerCaseStrings and countFrequencies , we create a UNIX command
wordFrequencyCount as follows:
#!/bin/sh
# wordFrequencyCount
lowerCaseStrings $1 >intermediateFile
countFrequencies intermediateFile
The command wordFrequencyCount is used as wordFrequencyCount tFile where
tFile is any plain-text file.
Explanation: lowerCaseStrings $1 applies lowerCaseStrings to the first argu-
ment (string, filename) after wordFrequencyCount ( cf. Section 12.2.1). The resulting
output is then written/redirected via > to the file intermediate File , which is cre-
ated if non-existent and overwritten if in existence 7 . inter mediateFile stays in
existence after wordFrequencyCount terminates and can be further used. Finally,
countFrequencies intermediateFile applies the word count to the intermediate
result.
Instead of using intermediateFile , one can let the UNIX system handle the
transfer (piping) of intermediate results from one program to another. The follow-
ing sh program is completely equivalent to the first listing of wordFrequencyCount
except that the intermediate result is stored nowhere:
#!/bin/sh
# wordFrequencyCount (2nd implementation)
lowerCaseStrings $1 | countFrequencies -
Explanation: The pipe symbol | causes the transfer (piping) of intermediate
results from lowerCaseStrings to countFrequencies . The pipe symbol | or the
string |\ can terminate a line, in which case the pipe is continued into the next line.
The trailing hyphen symbolizes the virtual file (in UNIX-jargon called “standard
input”) that is the input file for countFrequencies .
We observe that the output of countFrequencies is not sorted. The reader may
want to replace the last line in the program by
lowerCaseStrings $1 | countFrequencies - | sort -
employing the UNIX command sort as the final step in the processing.
Additional information about programming sh and the UNIX commands men-
tioned above can be obtained using the man sh command as well as consulting [30].
7 >> instead of > appends to an existing file.
Search WWH ::




Custom Search