Information Technology Reference
In-Depth Information
program does. Here,
yourTextFile
should be a smaller plain-text file in your home
directory, and the output of the command will appear in your shell window. It can
be redirected to a file (
cf.
Section 12.2.3).
Explanation:
The first line
#!/bin/sh
of
lowerCaseStrings
tells whatever shell
you are using that the command lines are designed for
sh
which executes the file.
The next three lines are comment. Comment for
sh
,
sed
and
awk
starts by defini-
tion with a
#
as first character in a line. Note that comment is not allowed within a
multi-line
sed
command. The last three lines of
lowerCaseStrings
are one
sh
com-
mand which calls
sed
and delivers two arguments (subsequent strings of characters)
to
sed
. The first entity following
sed
is a string of characters limited/marked by
single-quote characters
'
which constitutes the
sed
program. Within that program,
the
sed
command
y/ABC
...
Z/abc
...
z/
maps characters in the string
ABC
...
Z
to cor-
responding characters in the string
abc
...
z
. In the remainder of this paragraph, the
italicized string '
newline
' stands for the “invisible” character that causes a line-break
when displayed. The
sed
command
s/[^a-z][^a-z]*/\
newline
/g
substitutes(
s
)
every(
g
) string of non-letters by a single
newline
-character (encoded as
5
\
newline
).
A string of non-letters (to be precise: non-lower case letters) is thereby encoded as
one non
(
^
)-
letter
[^a-z]
followed by
an arbitrary number
(
*
)
of non-letters
[^a-z]*
.
Consequently,
s/[^a-z][^a-z]*/\
newline
/g
puts all strings of letters on separate
lines. The trailing
$1
is the second argument to
sed
and stands for the input-file
name.
6
In the above example, one has
$1
=
yourTextFile
.
Remark:
We shall refer to a program similar to
lowerCaseStrings
that only
contains the first line of the
sed
program invoking the
y
operator as
mapToLowerCase
.
12.2.2 Implementing a Frequency Count Using
awk
The above
sed
program combined with
awk
makes it easy to implement a simple
word frequency count. For that purpose, we need a counting program which we shall
name
countFrequencies
. The listing of
countFrequencies
shows the typical use
of an array (here:
number
)in
awk
(
cf.
Section 12.4.1.3).
#!/bin/sh
# countFrequencies (Counting strings of characters on lines.)
awk '{ number[$0]++ }
END { for (string in number) { print string , number[string] }}
'$1
Explanation:
awk
operates on the input (file) line by line. The string/symbol
$0
stands for the content of the line that is currently under consideration/manipulation.
tion to being readable and writable. Consult the manual pages entries (
i.e.
,type
man cd
and
man chmod
in your shell window) for further details about
cd
and
chmod
.
5
\newline
is seen as
newline
character by
sed
.
newline
alone would be interpreted
as the start of a new command line for
sed
.
6
More precisely: the trailing
$1
is the symbol the Bourne-shell uses to commu-
nicate the
first
string (
i.e.
, argument) after
lowerCaseStrings
to
sed
(
e.g.
,
yourTextFile
in
lowerCaseStrings yourTextFile
becomes
sed '
program
'
yourTextFile
). Arguments to a UNIX command are strings separated by white
space. Nine arguments
$1
...
$9
can be used in a UNIX command.
Search WWH ::
Custom Search