Information Technology Reference
In-Depth Information
#!/bin/sh
leaveOnlyWords $1 | oneItemPerLine - | context - 200 |
quadrupleWords - | countFrequencies -
Explanation: leaveOnlyWords $1 | oneItemPerLine | context - 200 gener-
ates all possible strings of 200 consecutive words in the file $1 . quadrupleWords
picks those words which occur first and are repeated at least three times within
lines. An implementation of quadrupleWords is left as an exercise; or consult [40].
countFrequencies determines the word frequencies of the determined words.
Note again that context - 200 creates an intermediate file which essentially is
200 times the size of the input. If one wants to apply the above to large files, then
the subsequent search in quadrupleWords should be combined with context - 200 .
We have applied the above procedure to the source file of an older version of this
document. Aside from function-words such as the and a few names, the following
were found with high frequency: UNIX , address , awk , character , command , field ,
format, liberal , line , pattern , program , sed , space , string , students , sum , and words .
12.5.4 Lexical-Etymological Analysis
In [19], the author determined the percentage of etymologically related words shared
by Serbo-Croatian, Bulgarian, Ukrainian, Russian, Czech, and Polish. The author
looked at 1672 words from the above languages to determine what percentage of
words each of the six languages shared with each of the other six languages. He
did this analysis by hand using a single source. This kind of analysis can help in
determining the validity of traditional language family groupings, e.g. :
Is the west-Slavic grouping of Czech, Polish, and Slovak supported by their lex-
ica?
Do any of these have a significant number of non-related words in its lexicon?
Is there any other language not in the traditional grouping worthy of inclusion
based on the number of words it shares with those in the group?
Information of this kind could also be applied to language teaching/learning by
making certain predictions about the ”learnability” of languages with more or less
similar lexica and developing language teaching materials targeted at learners from
a given related language ( e.g. , Polish learners of Serbo-Croatian).
Disregarding a discussion about possible copyright violations, it is easy today
to scan a text written in an alphabetic writing system into a computer to obtain
automatically a file format that can be evaluated by machine and, finally, do such a
lexical analysis of sorting/counting/intersecting with the means we have described
above. The source can be a text of any length. The search can be for any given
(more or less narrowly defined) string or number thereof. In principle, one could
scan in (or find on-line) a dictionary from each language in question to use as the
source-text. Then one could do the following:
1) Write rules using sed to “level” or standardize the orthography to make the text
uniform.
2) Write rules using sed to account for historical sound and phonological changes.
(Such rules are almost always systematic and predictable. For example: the German
intervocalic “t” is changed in English to “th.” Exceptional cases could be included
in the programs explicitly. All of these rules already exist, thanks to the efforts of
historical linguists over the last century ( cf. [15]).
Search WWH ::




Custom Search