Information Technology Reference
In-Depth Information
12.4.2 Vectors and Sets
We conclude the section on awk by introducing a standard file format called “vec-
tors.” For files of this format, we show how to define a large variety of operations
such as vector addition/subtraction and statistical operations. In addition, we define
set-operations. Such operations are very useful in numerical/statistical evaluations
and for comparison of data obtained by methods presented until this point in our
exposition.
Definition: Vector (Lists of Type/Token Ratios)
Suppose that one represents frequencies of occurrence of particular words or
phrases in the following way in a file: every line of the file consists of two parts
where the first part is a word or phrase which may contain digits and the second
part (the final field) is a single number which represents and will be called the
frequency. A file in this format will be called a vector ( list of type/token ratios ). An
example of an entry of a vector is given by
limit 55
Mathematically speaking, such a file of word/phrase frequencies is a vector over the
free base of character strings [20, p. 13]. The program countFre- quencies listed in
Section 12.2.2 generates vectors.
Vector Operations
In this section, we show how to implement vector operations using awk .
Application (vector addition): The next program vectorAddition implements
vector addition. If aFile and bFile are vectors, then vectorAddition is used
as cat aFile bFile | vectorAddition - . The UNIX command cat aFile bFile
concatenates files aFile bFile with the content of aFile leading.
vectorAddition can be used, e.g. , to measure the cumulative advance of students
in regard to vocabulary use.
#!/bin/sh
# vectorAddition
adjustBlankTabs $1 |
awk 'NF>1 { n=$(NF); $(NF)=""; sum[$0]+=n
}
END
{ for (string in sum) { print string sum[string] } }
'
- |
sort -
Explanation: In the first line of the awk program the last field in the pattern
space $0 is first saved in the variable n before the last field is set to the empty string
retaining a trailing blank (*). An array sum is generated which uses the altered
string $0 in the pattern space as index. Its components sum[$0] are used to sum up
( += ) all frequencies n corresponding to the altered string $0 . Recall that sum[$0]
is initiated to 0 automatically. After processing the input this way (at the END ),
the for -loop passes through the associative array sum with looping index string .
string is printed together with the values of the summations ( sum[string] ). Note
that there is no comma in the print statement in view of (*). Finally, the overall
output is sorted into standard lexicographical order using the UNIX command sort .
Search WWH ::




Custom Search