Linguistic Computing with UNIX Tools - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

• length( string ) returns the length of string , i.e. , the number of characters in string .

• index( bigstring , substring ) . Comment: This produces the position where substring

starts in bigstring .If substring is not contained in bigstring , then the value 0 is

returned. This allows analysis of fields beyond matching a substring.

• substr( string , n 1 , n 2 ) . Comment: This produces the n th

through the n th

character

of string .If n 2

> length( string ) or if n 2

is omitted, then string is copied from the

n th

character to the end.

• split( string ,arrayName,"c") . Comment: This splits string at every instance of

the separator character c into the array arrayName and returns the number of fields

encountered.

• string = sprintf( format , expr1 , expr2 ... ) . Comment: This sets

string to what is produced by printf format , expr1 , expr2 ... In regard to the

printing function printf in awk or C consult [40, 31, 4, 5] and the manual pages for

awk and printf .

Application (generating strings of context from a file): The next important ex-

ample shows the use of the functions index() and substr() in awk . It generates all

possible sequences of consecutive words of a certain length in a file. We shall refer to

it as context . Suppose that a file $1 is organized in such a way that single words are

on individual lines ( e.g. , the output of a pipe leaveOnlyWords | oneItemPerLine ).

context uses two arguments. The first argument $1 is supposed to be the name of

the file that is organized as described above. The second argument $2 is supposed

to be a positive integer. context then generates “context” of length $2 out of $1 .

In fact, all possible sequences of length $2 of consecutive words in $1 are generated

and printed.

1: #!/bin/sh

2: # context

3: # First argument $1 is input file name.

4: # Second argument $2 is context-length.

5: awk

'BEGIN { cLength='$2'+0 }

6: NR==1

{ c=$0

}

7: NR>1 { c=c""$0}

8: NR>cLength { c=substr(c,index(c," ")+1) }

9: NR>=cLength { print c

Explanation: Suppose the above program is invoked as context sourceFile 11 .

Then, $2 =11. In line 5, the awk -variable cLength is set to 11. Thereby, the operation

+0 forces any string contained in the second argument $2 to context , even the empty

string, to be considered as a number in the remainder of the program. In the second

command of the awk program (line 6), the context c is set to the first word ( i.e. , input

line). In the third command (line 7), any subsequent word (input line) other than

the first is appended to c separated by a blank. The fourth statement (line 8) works

as follows: after 12 words are collected in c , the first is cut away by using the position

of the first blank, i.e. , index(c," ") , and reproducing c from index(c," ")+1 until

the end. Thus, the word at the very left of c is lost. Finally (line 9), the context c

is printed, if it contains at least 11 words cLength .

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home