Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

Finally, there has to be a definition of unique one-to-one relations of lexica for

the languages under consideration. Of course, this has to be done separately for

every pair of languages.

12.5.5 Corpus Exploration and Concordance

The following sh program shows how to generate the surrounding context for words

from a text file $1 , i.e. , the file name is first argument $1 to the program. The second

argument to the program, i.e. , $2 , is supposed to be a strictly positive integer. In

this example, two words are related if there are not more that ( $2 ) − 2 other words

in between them.

1: #!/bin/sh

2: # surroundingContext

3: leaveOnlyWords $1 | oneItemPerLine - |

4: mapToLowerCase - | context - $2 |

5: awk '{ for (f=2;f<=NF;f++) { print $1,$(f) } }' |

6: countFrequencies -

Explanation: If a file contains the strings (words) aa , ab , ac , ... zz and $2 =6,

then the first line of output of the code in lines 3-4 (into the pipe continued at line

5) would be aa ab ac ad ae af . That is what the awk program in line 5 would see

as first line of input. The awk program would then print aa ab , aa ac , ... aa af on

separate lines as response to that first line of input. The occurrence of such pairs is

then counted by countFrequencies . This defines a matrix M d of directed context

(asymmetric relation) between the words in a text. M d is indexed by pairs of words

( word 1 , word 2 ). If the frequency of the entry in M d pertaining to ( word 1 , word 2 )is

low, then the two words word 1 and word 2 are distant or unrelated.

Applying the procedure listed above to the source file of an older version of this

document and filtering out low frequencies using filterHighFrequencies - 20 the

following pairs of word were found in close proximity among a long list containing

otherwise mostly “noise”: (address pattern), (awk print), (awk program), (awk sed),

(echo echo), (example program), (hold space), (input line), (liberal liberal), (line

number), (newline character), (pattern space), (print print), (program line), (range

sed), (regular expressions), (sed program), (sh awk), (sh bin), (sh program), (sh

sed), (string string), and (substitution command).

Using the simple program listed above or some suitable modification, any lan-

guage researcher or teacher can conduct basic concordancing and text analysis with-

out having to purchase sometimes expensive and often inflexible concordancing or

corpus-exploration software packages. See the example given in the introduction.

In [38], a corpus search for the strings characterized by the following awk patterns

(a|(an)|(for)|(had)|(many)) [A-Za-z'-]+ of

((be)|(too)) [A-Za-z'-]+ to

was conducted. Modifying the program listed in the introduction as follows would

allow the user to search for these strings and print the strings themselves and ten

words to both the right and left of the patterns in separate files.

Search WWH ::

Custom Search

Home