Information Technology Reference
In-Depth Information
#!/bin/sh
leaveOnlyWords $1| oneItemPerLine -| mapToLowerCase -| context - 23|
awk '($(11)~/^((an?)|(for)|(had)|(many))$/)&&($(13)=="of") {
File="'$1'." $(11) ".of"; print>File }
($(11)~/^((be)|(too))$/) &&($(13)=="to") {
File="'$1'." $(11) ".to"; print>File }' -
It has been noted in several corpus studies of English collocation ([32, 41, 6]) that
searching for 5 words on either side of a given word will find 95% of collocational co-
occurrence in a text. After a search has been done for all occurrences of word
word
1
and the accompanying 5 words on either side in a large corpus, one can then search
the resulting list of surrounding words for multiple occurrences of word
word
2
to
determine with what probability
word
1
co-occurs with
word
2
. The formula in [12, p.
291] can then be used to determine whether an observed frequency of co-occurrence
in a given text is indeed significantly greater than the expected frequency.
In [9], the English double genitive construction,
e.g.
, “a friend of mine” is
compared in terms of function and meaning to the preposed genitive construc-
tion “my friend.” In this situation, a simple search for strings containing
of
((mine)|(yours)|...)
(
dative possessive pronouns
) and
of .*'s
would locate all
of the double genitive constructions (and possibly the occasional contraction, which
could be discarded during the subsequent analysis). In addition, a search for
nom-
inative possessive pronouns
and
of .*'s
together with the ten words that follow
every occurrence of these two grammatical patterns would find all of the preposed
genitives (again, with some contractions). Furthermore, a citation for each located
string can be generated that includes document title, approximate page number and
line number.
12.5.6 Reengineering Text Files across Different File Formats
In the course of the investigations outlined in [1, 2, 3], one of the authors developed
a family of programs that are able to transform the source file of [37], which was
typed with a
what-you-see-is-what-you-get
editor into a
prolog
database. In fact,
any machine-readable format can now be generated by slightly altering the programs
already developed.
The source was available in two formats: 1) an RTF format file, and 2) a text
file free of control sequences that was generated from the first file. Both formats
have advantages and disadvantages. As outlined in Section 12.3.5, the RTF format
file distinguishes Japanese
on
and kun pronunciation from ordinary English text
using
italic
and small cap typesetting, respectively. On the other hand, the RTF
format file contains many control sequences that make the text “dirty” in regard
to machine evaluation. We have already outlined in Section 12.3.5 how unwanted
control sequences in the RTF format file were eliminated, but valuable information
in regard to the distinction of
on
pronunciation, kun pronunciation and English
was retained. The second control-sequence-free file contains the standard format of
kanji which is better suited for processing in the UNIX environment we used. In
addition, this format is somewhat more regular, which is useful in regard to pattern
matching that identifies the three different categories of entries in [37]:
radical
,
kanji
and
compound
. However, very valuable information is lost in the second file in regard
to the distinction between
on
pronunciation, kun pronunciation and English.
Search WWH ::
Custom Search