Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

#!/bin/sh

leaveOnlyWords $1 | oneItemPerLine - | context - 200 |

quadrupleWords - | countFrequencies -

Explanation: leaveOnlyWords $1 | oneItemPerLine | context - 200 gener-

ates all possible strings of 200 consecutive words in the file $1 . quadrupleWords

picks those words which occur first and are repeated at least three times within

lines. An implementation of quadrupleWords is left as an exercise; or consult [40].

countFrequencies determines the word frequencies of the determined words.

Note again that context - 200 creates an intermediate file which essentially is

200 times the size of the input. If one wants to apply the above to large files, then

the subsequent search in quadrupleWords should be combined with context - 200 .

We have applied the above procedure to the source file of an older version of this

document. Aside from function-words such as the and a few names, the following

were found with high frequency: UNIX , address , awk , character , command , field ,

format, liberal , line , pattern , program , sed , space , string , students , sum , and words .

12.5.4 Lexical-Etymological Analysis

In [19], the author determined the percentage of etymologically related words shared

by Serbo-Croatian, Bulgarian, Ukrainian, Russian, Czech, and Polish. The author

looked at 1672 words from the above languages to determine what percentage of

words each of the six languages shared with each of the other six languages. He

did this analysis by hand using a single source. This kind of analysis can help in

determining the validity of traditional language family groupings, e.g. :

• Is the west-Slavic grouping of Czech, Polish, and Slovak supported by their lex-

ica?

•

Do any of these have a significant number of non-related words in its lexicon?

• Is there any other language not in the traditional grouping worthy of inclusion

based on the number of words it shares with those in the group?

Information of this kind could also be applied to language teaching/learning by

making certain predictions about the ”learnability” of languages with more or less

similar lexica and developing language teaching materials targeted at learners from

a given related language ( e.g. , Polish learners of Serbo-Croatian).

Disregarding a discussion about possible copyright violations, it is easy today

to scan a text written in an alphabetic writing system into a computer to obtain

automatically a file format that can be evaluated by machine and, finally, do such a

lexical analysis of sorting/counting/intersecting with the means we have described

above. The source can be a text of any length. The search can be for any given

(more or less narrowly defined) string or number thereof. In principle, one could

scan in (or find on-line) a dictionary from each language in question to use as the

source-text. Then one could do the following:

1) Write rules using sed to “level” or standardize the orthography to make the text

uniform.

2) Write rules using sed to account for historical sound and phonological changes.

(Such rules are almost always systematic and predictable. For example: the German

intervocalic “t” is changed in English to “th.” Exceptional cases could be included

in the programs explicitly. All of these rules already exist, thanks to the efforts of

historical linguists over the last century ( cf. [15]).

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home