Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

#!/bin/sh

leaveOnlyWords $1| oneItemPerLine -| mapToLowerCase -| context - 23|

awk '($(11)~/^((an?)|(for)|(had)|(many))$/)&&($(13)=="of") {

File="'$1'." $(11) ".of"; print>File }

($(11)~/^((be)|(too))$/) &&($(13)=="to") {

File="'$1'." $(11) ".to"; print>File }' -

It has been noted in several corpus studies of English collocation ([32, 41, 6]) that

searching for 5 words on either side of a given word will find 95% of collocational co-

occurrence in a text. After a search has been done for all occurrences of word word 1

and the accompanying 5 words on either side in a large corpus, one can then search

the resulting list of surrounding words for multiple occurrences of word word 2 to

determine with what probability word 1 co-occurs with word 2 . The formula in [12, p.

291] can then be used to determine whether an observed frequency of co-occurrence

in a given text is indeed significantly greater than the expected frequency.

In [9], the English double genitive construction, e.g. , “a friend of mine” is

compared in terms of function and meaning to the preposed genitive construc-

tion “my friend.” In this situation, a simple search for strings containing of

((mine)|(yours)|...) ( dative possessive pronouns ) and of .*'s would locate all

of the double genitive constructions (and possibly the occasional contraction, which

could be discarded during the subsequent analysis). In addition, a search for nom-

inative possessive pronouns and of .*'s together with the ten words that follow

every occurrence of these two grammatical patterns would find all of the preposed

genitives (again, with some contractions). Furthermore, a citation for each located

string can be generated that includes document title, approximate page number and

line number.

12.5.6 Reengineering Text Files across Different File Formats

In the course of the investigations outlined in [1, 2, 3], one of the authors developed

a family of programs that are able to transform the source file of [37], which was

typed with a what-you-see-is-what-you-get editor into a prolog database. In fact,

any machine-readable format can now be generated by slightly altering the programs

already developed.

The source was available in two formats: 1) an RTF format file, and 2) a text

file free of control sequences that was generated from the first file. Both formats

have advantages and disadvantages. As outlined in Section 12.3.5, the RTF format

file distinguishes Japanese on and kun pronunciation from ordinary English text

using italic and small cap typesetting, respectively. On the other hand, the RTF

format file contains many control sequences that make the text “dirty” in regard

to machine evaluation. We have already outlined in Section 12.3.5 how unwanted

control sequences in the RTF format file were eliminated, but valuable information

in regard to the distinction of on pronunciation, kun pronunciation and English

was retained. The second control-sequence-free file contains the standard format of

kanji which is better suited for processing in the UNIX environment we used. In

addition, this format is somewhat more regular, which is useful in regard to pattern

matching that identifies the three different categories of entries in [37]: radical , kanji

and compound . However, very valuable information is lost in the second file in regard

to the distinction between on pronunciation, kun pronunciation and English.

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home