Information Technology Reference
In-Depth Information
#!/bin/sh
leaveOnlyWords $1 | oneItemPerLine - | context - 200 |
quadrupleWords - | countFrequencies -
Explanation:
leaveOnlyWords $1 | oneItemPerLine | context - 200
gener-
ates all possible strings of 200 consecutive words in the file
$1
.
quadrupleWords
picks those words which occur first and are repeated at least three times within
lines. An implementation of
quadrupleWords
is left as an exercise; or consult [40].
countFrequencies
determines the word frequencies of the determined words.
Note again that
context - 200
creates an intermediate file which essentially is
200 times the size of the input. If one wants to apply the above to large files, then
the subsequent search in
quadrupleWords
should be combined with
context - 200
.
We have applied the above procedure to the source file of an older version of this
document. Aside from function-words such as
the
and a few names, the following
were found with high frequency:
UNIX
,
address
,
awk
,
character
,
command
,
field
,
format, liberal
,
line
,
pattern
,
program
,
sed
,
space
,
string
,
students
,
sum
, and
words
.
12.5.4 Lexical-Etymological Analysis
In [19], the author determined the percentage of etymologically related words shared
by Serbo-Croatian, Bulgarian, Ukrainian, Russian, Czech, and Polish. The author
looked at 1672 words from the above languages to determine what percentage of
words each of the six languages shared with each of the other six languages. He
did this analysis by hand using a single source. This kind of analysis can help in
determining the validity of traditional language family groupings,
e.g.
:
•
Is the west-Slavic grouping of Czech, Polish, and Slovak supported by their lex-
ica?
•
Do any of these have a significant number of non-related words in its lexicon?
•
Is there any other language not in the traditional grouping worthy of inclusion
based on the number of words it shares with those in the group?
Information of this kind could also be applied to language teaching/learning by
making certain predictions about the ”learnability” of languages with more or less
similar lexica and developing language teaching materials targeted at learners from
a given related language (
e.g.
, Polish learners of Serbo-Croatian).
Disregarding a discussion about possible copyright violations, it is easy today
to scan a text written in an alphabetic writing system into a computer to obtain
automatically a file format that can be evaluated by machine and, finally, do such a
lexical analysis of sorting/counting/intersecting with the means we have described
above. The source can be a text of any length. The search can be for any given
(more or less narrowly defined) string or number thereof. In principle, one could
scan in (or find on-line) a dictionary from each language in question to use as the
source-text. Then one could do the following:
1) Write rules using
sed
to “level” or standardize the orthography to make the text
uniform.
2) Write rules using
sed
to account for historical sound and phonological changes.
(Such rules are almost always systematic and predictable. For example: the German
intervocalic “t” is changed in English to “th.” Exceptional cases could be included
in the programs explicitly. All of these rules already exist, thanks to the efforts of
historical linguists over the last century (
cf.
[15]).
Search WWH ::
Custom Search