Information Technology Reference
In-Depth Information
#!/bin/sh
leaveOnlyWords $1| oneItemPerLine -| mapToLowerCase -| context - 23|
awk '($(11)~/^((an?)|(for)|(had)|(many))$/)&&($(13)=="of") {
File="'$1'." $(11) ".of"; print>File }
($(11)~/^((be)|(too))$/) &&($(13)=="to") {
File="'$1'." $(11) ".to"; print>File }' -
It has been noted in several corpus studies of English collocation ([32, 41, 6]) that
searching for 5 words on either side of a given word will find 95% of collocational co-
occurrence in a text. After a search has been done for all occurrences of word word 1
and the accompanying 5 words on either side in a large corpus, one can then search
the resulting list of surrounding words for multiple occurrences of word word 2 to
determine with what probability word 1 co-occurs with word 2 . The formula in [12, p.
291] can then be used to determine whether an observed frequency of co-occurrence
in a given text is indeed significantly greater than the expected frequency.
In [9], the English double genitive construction, e.g. , “a friend of mine” is
compared in terms of function and meaning to the preposed genitive construc-
tion “my friend.” In this situation, a simple search for strings containing of
((mine)|(yours)|...) ( dative possessive pronouns ) and of .*'s would locate all
of the double genitive constructions (and possibly the occasional contraction, which
could be discarded during the subsequent analysis). In addition, a search for nom-
inative possessive pronouns and of .*'s together with the ten words that follow
every occurrence of these two grammatical patterns would find all of the preposed
genitives (again, with some contractions). Furthermore, a citation for each located
string can be generated that includes document title, approximate page number and
line number.
12.5.6 Reengineering Text Files across Different File Formats
In the course of the investigations outlined in [1, 2, 3], one of the authors developed
a family of programs that are able to transform the source file of [37], which was
typed with a what-you-see-is-what-you-get editor into a prolog database. In fact,
any machine-readable format can now be generated by slightly altering the programs
already developed.
The source was available in two formats: 1) an RTF format file, and 2) a text
file free of control sequences that was generated from the first file. Both formats
have advantages and disadvantages. As outlined in Section 12.3.5, the RTF format
file distinguishes Japanese on and kun pronunciation from ordinary English text
using italic and small cap typesetting, respectively. On the other hand, the RTF
format file contains many control sequences that make the text “dirty” in regard
to machine evaluation. We have already outlined in Section 12.3.5 how unwanted
control sequences in the RTF format file were eliminated, but valuable information
in regard to the distinction of on pronunciation, kun pronunciation and English
was retained. The second control-sequence-free file contains the standard format of
kanji which is better suited for processing in the UNIX environment we used. In
addition, this format is somewhat more regular, which is useful in regard to pattern
matching that identifies the three different categories of entries in [37]: radical , kanji
and compound . However, very valuable information is lost in the second file in regard
to the distinction between on pronunciation, kun pronunciation and English.
Search WWH ::




Custom Search