Information Technology Reference
In-Depth Information
Finally, there has to be a definition of unique one-to-one relations of lexica for
the languages under consideration. Of course, this has to be done separately for
every pair of languages.
12.5.5 Corpus Exploration and Concordance
The following sh program shows how to generate the surrounding context for words
from a text file $1 , i.e. , the file name is first argument $1 to the program. The second
argument to the program, i.e. , $2 , is supposed to be a strictly positive integer. In
this example, two words are related if there are not more that ( $2 ) 2 other words
in between them.
1: #!/bin/sh
2: # surroundingContext
3: leaveOnlyWords $1 | oneItemPerLine - |
4: mapToLowerCase - | context - $2 |
5: awk '{ for (f=2;f<=NF;f++) { print $1,$(f) } }' |
6: countFrequencies -
Explanation: If a file contains the strings (words) aa , ab , ac , ... zz and $2 =6,
then the first line of output of the code in lines 3-4 (into the pipe continued at line
5) would be aa ab ac ad ae af . That is what the awk program in line 5 would see
as first line of input. The awk program would then print aa ab , aa ac , ... aa af on
separate lines as response to that first line of input. The occurrence of such pairs is
then counted by countFrequencies . This defines a matrix M d of directed context
(asymmetric relation) between the words in a text. M d is indexed by pairs of words
( word 1 , word 2 ). If the frequency of the entry in M d pertaining to ( word 1 , word 2 )is
low, then the two words word 1 and word 2 are distant or unrelated.
Applying the procedure listed above to the source file of an older version of this
document and filtering out low frequencies using filterHighFrequencies - 20 the
following pairs of word were found in close proximity among a long list containing
otherwise mostly “noise”: (address pattern), (awk print), (awk program), (awk sed),
(echo echo), (example program), (hold space), (input line), (liberal liberal), (line
number), (newline character), (pattern space), (print print), (program line), (range
sed), (regular expressions), (sed program), (sh awk), (sh bin), (sh program), (sh
sed), (string string), and (substitution command).
Using the simple program listed above or some suitable modification, any lan-
guage researcher or teacher can conduct basic concordancing and text analysis with-
out having to purchase sometimes expensive and often inflexible concordancing or
corpus-exploration software packages. See the example given in the introduction.
In [38], a corpus search for the strings characterized by the following awk patterns
(a|(an)|(for)|(had)|(many)) [A-Za-z'-]+ of
((be)|(too)) [A-Za-z'-]+ to
was conducted. Modifying the program listed in the introduction as follows would
allow the user to search for these strings and print the strings themselves and ten
words to both the right and left of the patterns in separate files.
Search WWH ::




Custom Search