Information Technology Reference
In-Depth Information
program does. Here, yourTextFile should be a smaller plain-text file in your home
directory, and the output of the command will appear in your shell window. It can
be redirected to a file ( cf. Section 12.2.3).
Explanation: The first line #!/bin/sh of lowerCaseStrings tells whatever shell
you are using that the command lines are designed for sh which executes the file.
The next three lines are comment. Comment for sh , sed and awk starts by defini-
tion with a # as first character in a line. Note that comment is not allowed within a
multi-line sed command. The last three lines of lowerCaseStrings are one sh com-
mand which calls sed and delivers two arguments (subsequent strings of characters)
to sed . The first entity following sed is a string of characters limited/marked by
single-quote characters ' which constitutes the sed program. Within that program,
the sed command y/ABC ... Z/abc ... z/ maps characters in the string ABC ... Z to cor-
responding characters in the string abc ... z . In the remainder of this paragraph, the
italicized string ' newline ' stands for the “invisible” character that causes a line-break
when displayed. The sed command s/[^a-z][^a-z]*/\ newline /g substitutes( s )
every( g ) string of non-letters by a single newline -character (encoded as 5 \ newline ).
A string of non-letters (to be precise: non-lower case letters) is thereby encoded as
one non ( ^ )- letter [^a-z] followed by an arbitrary number ( * ) of non-letters [^a-z]* .
Consequently, s/[^a-z][^a-z]*/\ newline /g puts all strings of letters on separate
lines. The trailing $1 is the second argument to sed and stands for the input-file
name. 6 In the above example, one has $1 = yourTextFile .
Remark: We shall refer to a program similar to lowerCaseStrings that only
contains the first line of the sed program invoking the y operator as mapToLowerCase .
12.2.2 Implementing a Frequency Count Using awk
The above sed program combined with awk makes it easy to implement a simple
word frequency count. For that purpose, we need a counting program which we shall
name countFrequencies . The listing of countFrequencies shows the typical use
of an array (here: number )in awk ( cf. Section 12.4.1.3).
#!/bin/sh
# countFrequencies (Counting strings of characters on lines.)
awk '{ number[$0]++ }
END { for (string in number) { print string , number[string] }}
'$1
Explanation: awk operates on the input (file) line by line. The string/symbol $0
stands for the content of the line that is currently under consideration/manipulation.
tion to being readable and writable. Consult the manual pages entries ( i.e. ,type
man cd and man chmod in your shell window) for further details about cd and
chmod .
5
\newline is seen as newline character by sed . newline alone would be interpreted
as the start of a new command line for sed .
6 More precisely: the trailing $1 is the symbol the Bourne-shell uses to commu-
nicate the first string ( i.e. , argument) after lowerCaseStrings to sed ( e.g. ,
yourTextFile in lowerCaseStrings yourTextFile becomes sed ' program '
yourTextFile ). Arguments to a UNIX command are strings separated by white
space. Nine arguments $1 ... $9 can be used in a UNIX command.
Search WWH ::




Custom Search