Information Technology Reference
In-Depth Information
we repeated this process of adding, testing, and differentiating strings of English
over thousands of texts. We stayed with this process until we had nearly 150
categories that seemed robust and stable and that could differentiate, in princi-
ple, millions of strings. As our coding of strings evolved, we were able to derive
formal decision criteria for classifying strings into one of 18 overall dimensions.
The string matcher can match any literal string of English of any length. For
eciency of coding, the software allowed us to run the string matcher on up to
500 texts at a time and over any number of user-defined categories of different
strings. When the string matcher found a matching string in any of the target
texts, it tagged it by name and color. The visualizer made it relatively easy for
the research team to study the performance of the string-matcher and to improve
it rapidly based on errors in its performance. The visualizer made it possible to
build a very large and consistently classified inventory of priming strings in a
relatively short amount of time.
Where did we find the speech and texts to look for priming strings? We sam-
pled the Lincoln/Douglas debates [13], texts associated with description, nar-
rative, exposition, reporting, quotation, dialog, and conversational interaction.
We also relied on three “seed” text collections. The first was a 120 text digital
archive of short stories and fictional work. The second was a database of 45
electronic documents associated with a software engineering project, including
proposals to the client, software design specifications, meeting minutes within
the design team, meeting minutes between the design team and the client team,
software documentation, focus group reports, public relation announcements,
and feature interviews. We constructed a third archive from the Internet: the
Federalist papers, the Presidential Inaugurals, the journals of Lewis and Clark,
song lyrics from rapsters and rockers, the clips of various syndicated newspaper
columnists, the Web-page welcomes of 30 university presidents, Aesop fables and
the Brother's Grimm, the writings of Malcolm X, the 100 great speeches of the
20th century, 10 years of newspaper reporting on the Exxon Valdez disaster, and
movie reviews. We sampled 200 texts from this collection and made sure that
we had multiple instances of each type of writing so that each type could be
divided into training and test runs as we cycled through test and improvement
cycles. On a weekly basis over a three-year period, we also coded strings from the
New Yorker magazine and from the editorials, features, and news pages of The
New York Times. To capture data from speech, we coded for 2 to 4 hours every
week the strings we identified heard over radio stations focused on news, talk, or
sports. The visualization environment allowed us to visually inspect and test new
samples in our archive. To further quality control, we built a collision detector
that would warn us if we assigned the same string to multiple categories. This
helped us locate ambiguities in the string data and debugging inconsistencies.
The visualization environment we taught with [2] also became a centerpiece in
one of our graduate writing courses. As part of their continuing training in close
reading, students were asked to keep semester logs of the matched strings they
found in their own writing and in the writing of their peers. They were asked to
keep systematic notes about whether the strings matched by the software were
Search WWH ::




Custom Search