Information Technology Reference
In-Depth Information
12.4.2 Vectors and Sets
We conclude the section on
awk
by introducing a standard file format called “vec-
tors.” For files of this format, we show how to define a large variety of operations
such as vector addition/subtraction and statistical operations. In addition, we define
set-operations. Such operations are very useful in numerical/statistical evaluations
and for comparison of data obtained by methods presented until this point in our
exposition.
Definition: Vector (Lists of Type/Token Ratios)
Suppose that one represents frequencies of occurrence of particular words or
phrases in the following way in a file: every line of the file consists of two parts
where the first part is a word or phrase which may contain digits and the second
part (the final field) is a single number which represents and will be called the
frequency. A file in this format will be called a
vector
(
list of type/token ratios
). An
example of an entry of a vector is given by
limit 55
Mathematically speaking, such a file of word/phrase frequencies is a vector over the
free base of character strings [20, p. 13]. The program
countFre- quencies
listed in
Section 12.2.2 generates vectors.
Vector Operations
In this section, we show how to implement vector operations using
awk
.
Application (vector addition):
The next program
vectorAddition
implements
vector addition. If
aFile
and
bFile
are vectors, then
vectorAddition
is used
as
cat aFile bFile | vectorAddition -
. The UNIX command
cat aFile bFile
concatenates files
aFile bFile
with the content of
aFile
leading.
vectorAddition
can be used,
e.g.
, to measure the cumulative advance of students
in regard to vocabulary use.
#!/bin/sh
# vectorAddition
adjustBlankTabs $1 |
awk 'NF>1 { n=$(NF); $(NF)=""; sum[$0]+=n
}
END
{ for (string in sum) { print string sum[string] } }
'
- |
sort -
Explanation:
In the first line of the
awk
program the last field in the pattern
space
$0
is first saved in the variable
n
before the last field is set to the empty string
retaining a trailing blank (*). An array
sum
is generated which uses the altered
string
$0
in the pattern space as index. Its components
sum[$0]
are used to sum up
(
+=
) all frequencies
n
corresponding to the altered string
$0
. Recall that
sum[$0]
is initiated to 0 automatically. After processing the input this way (at the
END
),
the
for
-loop passes through the associative array
sum
with looping index
string
.
string
is printed together with the values of the summations (
sum[string]
). Note
that there is no comma in the
print
statement in view of (*). Finally, the overall
output is sorted into standard lexicographical order using the UNIX command
sort
.
Search WWH ::
Custom Search