Textual Genre Analysis and Identification - Ambient Intelligence for Scientific Discovery - page 139

Information Technology Reference

In-Depth Information

5.4

Heuristic 4: Prefer Long over Short Strings - But Not Too Long

This heuristic is already implicit in the first three. Lengthening existing strings

is the main operation through which coded strings are tested to rule out over-

committed classifications, to assure conservative codings, and to diversify strings

across different contexts. By definition, longer strings absorb more context than

shorter strings. This makes them useful lenses for understanding the commit-

ments of shorter strings. This fourth heuristic, incidentally, is built-into the

string matcher algorithm, which, from the same starting point, leaps over shorter

matches if longer ones are available. Notice in Table 2 how lengthening the mo-

tion word jump makes it feasible to understand a range of functions that one

could not have easily predicted from the single word.

Table 2.

Variant Lengthening off the word “Jump” Produces Different Functions.

While jumping on the walls

continuous motion

She jumped at the opportunity

positive standard

They jumped to the conclusion

negative standard

He jumped around the house

motion

He'll get a jump on the problem

positive standard

They jumped all over him

negative affect

He is good at jumping rope

generic motion

I've jumped around the country

autobiographical

He jumped around the country

scene shift

Still, long is not always better, especially when the frequency of recurrence of

a long string approximates zero. Before admitting a string, we queried ourselves

whether the string had a chance of re-use across other texts and writers. If we

could not answer this re-use question positively, we did not include it.

6

Identifying Genres:

Exploring Language and Culture in the Tech Review

Using these heuristics, DocuScope has been developed into a text visualization

and analysis environment containing a catalog of over 300 million strings or-

ganized as shown in Table 3. At the highest level, these strings fall into three

distinct clusters; at the lowest level, they are divided into a little over 140 classes;

at the mid-level, there are 18 dimensions. This hierarchical structure is, in ef-

fect, a multivariate model of text with the potential power to articulate and even

improve upon rhetorical reading in genre analysis and discovery.

6.1

Reading and Multivariate Models

It is our contention that the rhetorical reader approaches the task of genre se-

lection as a serial task with underlying multivariate components similar to those

Next Page

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home