Textual Genre Analysis and Identification - Ambient Intelligence for Scientific Discovery

Information Technology Reference

In-Depth Information

we repeated this process of adding, testing, and differentiating strings of English

over thousands of texts. We stayed with this process until we had nearly 150

categories that seemed robust and stable and that could differentiate, in princi-

ple, millions of strings. As our coding of strings evolved, we were able to derive

formal decision criteria for classifying strings into one of 18 overall dimensions.

The string matcher can match any literal string of English of any length. For

eciency of coding, the software allowed us to run the string matcher on up to

500 texts at a time and over any number of user-defined categories of different

strings. When the string matcher found a matching string in any of the target

texts, it tagged it by name and color. The visualizer made it relatively easy for

the research team to study the performance of the string-matcher and to improve

it rapidly based on errors in its performance. The visualizer made it possible to

build a very large and consistently classified inventory of priming strings in a

relatively short amount of time.

Where did we find the speech and texts to look for priming strings? We sam-

pled the Lincoln/Douglas debates [13], texts associated with description, nar-

rative, exposition, reporting, quotation, dialog, and conversational interaction.

We also relied on three “seed” text collections. The first was a 120 text digital

archive of short stories and fictional work. The second was a database of 45

electronic documents associated with a software engineering project, including

proposals to the client, software design specifications, meeting minutes within

the design team, meeting minutes between the design team and the client team,

software documentation, focus group reports, public relation announcements,

and feature interviews. We constructed a third archive from the Internet: the

Federalist papers, the Presidential Inaugurals, the journals of Lewis and Clark,

song lyrics from rapsters and rockers, the clips of various syndicated newspaper

columnists, the Web-page welcomes of 30 university presidents, Aesop fables and

the Brother's Grimm, the writings of Malcolm X, the 100 great speeches of the

20th century, 10 years of newspaper reporting on the Exxon Valdez disaster, and

movie reviews. We sampled 200 texts from this collection and made sure that

we had multiple instances of each type of writing so that each type could be

divided into training and test runs as we cycled through test and improvement

cycles. On a weekly basis over a three-year period, we also coded strings from the

New Yorker magazine and from the editorials, features, and news pages of The

New York Times. To capture data from speech, we coded for 2 to 4 hours every

week the strings we identified heard over radio stations focused on news, talk, or

sports. The visualization environment allowed us to visually inspect and test new

samples in our archive. To further quality control, we built a collision detector

that would warn us if we assigned the same string to multiple categories. This

helped us locate ambiguities in the string data and debugging inconsistencies.

The visualization environment we taught with [2] also became a centerpiece in

one of our graduate writing courses. As part of their continuing training in close

reading, students were asked to keep semester logs of the matched strings they

found in their own writing and in the writing of their peers. They were asked to

keep systematic notes about whether the strings matched by the software were

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home