Then Came the Bots (Wikipedia)

It was incredible what the Wikipedia community achieved with individual volunteer editors. But in terms of enforcing any kind of uniformity (making changes across all the animal-related articles on site, for example) it was a much harder slog.

As Wikipedia got bigger, it was even harder to coordinate going through thousands of articles to make sure each one was consistent with the others. Is the spelling correct? Is punctuation used the same way? Do all the articles about animals have genus and species defined?

So while Wikipedia didn’t depend on users having advanced computer skills, people who knew how to write computer programs had a major role to play. And perhaps that’s one of the most brilliant parts of Wikipedia’s culture. It may have been inspired by earlier online communities, but it remained an inclusive, human-oriented endeavor, allowing tech elite and tech averse to work side by side toward the same goal of building an encyclopedia. There was no better example of this than when software "robots" came onto the scene after Wikipedia got too big to maintain simply by hand.

In October 2002 Derek Ramsey was at the center of the most controversial move in Wikipedia history. Wikipedia had been growing slowly, and its English-language edition was starting to gather impressive attention. It had roughly 50,000 English articles, while other languages were just getting off the ground. With just a few thousand articles were the nascent communities of German, French, Dutch, Polish, and Esperanto.


Ramsey was a skinny computer science student, recently graduated from the Rochester Institute of Technology when he discovered Wikipedia in September 2002. Always interested in mathematics and statistics, he noticed that Wikipedia had a smattering of articles about big cities and towns in the United States, but it was by no means comprehensive or complete. He signed up as User:Ram-man, and started work on the articles related to geography, something he had always loved.

Being a numbers wonk, he saw that the United States Census data was on the Internet for anyone to download. Performed once every ten years, the U.S. Census attempts to record every single living person in the United States for the purpose of determining how representation for each state is divided in Congress. Along the way, census takers capture detailed statistics for individuals in each town, city, and village. Ramsey thought the publicly available census data would certainly be valuable to help fill in Wikipedia articles.

Fortunately, copyright issues about the numbers and statistics were not a problem. Works by the United States government are not protected by copyright, and are considered public domain. In the early days of the Internet, before Wikipedia was created, The CIA World Fact Book was one of the most popular references on the Internet. Despite its ominous title, the collection of detailed country profiles by the top U.S. intelligence organization was public domain and could be freely copied and published on the Internet. The same was true of the census data from the year 2000, which could be downloaded from the U.S. government Web site.

When Ramsey discovered Wikipedia, he was a new graduate, recently married but unemployed. "After the dot-com bubble bursting and September 11," he says, "software engineering jobs were more scarce, so I didn’t find a job until November 2002." In the intervening months, he took the time to experiment with Wikipedia. After dabbling with a handful of articles, he saw the power of creating pages. But being a computer scientist, he was irked by the haphazard nature of Wikipedia.

"I discovered that most of the cities that I wanted to work on did not exist.

I didn’t want to create an article with one sentence with some little bit of trivia. I also discovered that many other people were shy about creating such articles. I wanted to create an article on every U.S. city and county so that people would have stubs to work from and not feel daunted by article creation."

Ramsey thought, why not insert all the census data into Wikipedia? A visit to the Census Bureau’s Web site showed a bounty of statistics and geographic data. But it was like a jigsaw puzzle. It contained a hodgepodge of information and uncorrelated data in different formats. Seeing it as a challenge to get it unified and inserted into Wikipedia, Ramsey plowed in and got to work.

He spent hours sifting through the numbers and cross-referencing the information from multiple databases. The job wasn’t without "contradictions and difficulties." Longitude and latitude coordinates, postal zip codes, and Federal Information Processing Standard codes made up a maze of numbers that needed to be sifted and correlated. After dozens of iterations and using his computer programming skills, Ramsey eventually generated a unified database that could be processed and systematically inserted into Wikipedia.

At first he started with the 3,000 counties in the entire United States and inserted each one of them by hand, manually copying and pasting the articles into newly created Wikipedia articles.

"I was unemployed, so while I was not job hunting, I was working on Wikipedia during the day, for many many hours."

Finishing the 3,000 entries kicked off something in Ramsey’s pleasure center. He had his first whiff of "wiki-crack," the irreverent jargon Wikipedians have used to describe their addiction. So he set out on the next task—adding 33,832 city articles to Wikipedia. The problem was, at the same rate of entry, he calculated it would take months to hand-edit and create all those articles.

He saw that other Wikipedians had instead crafted software programs to act like human editors to insert data into Wikipedia. These "software robots," or bots, mimicked what a human editor would do but never tired or asked for a break. Bots were usually created using a simple scripting language like Perl or PHP, the same system used for creating Wikipedia’s software, and were hopefully well tested lest they wreak havoc on the articles.

Until then, these robots had done rather small, repetitive tasks like fixing punctuation or reformatting pages, things that were easily interpreted as being useful to Wikipedia.

Ramsey thought, why not use this method to insert the census data into Wikipedia? Putting his programming skills to work, and reviewing the work of previous bot creators, he created his own version to do the job. His bot did exactly what a human would do: create an article, load numbers from a database, copy the prepared text in, save the article, and go on to the next one. But it wouldn’t get tired or bored and, as a result, wouldn’t make sloppy mistakes.

After some "massaging" of the census data so it was all consistently formatted and bot-friendly, Ramsey was ready to fire. This bot was going to be a bit different, though. In an English Wikipedia with just over 50,000 articles, he was about to push the button to add 33,832 more, all in one shot. He would instantly be responsible for 40 percent of all the Wikipedia articles.

Such mass creation of articles had never been done before. He wasn’t sure how the community would react. But in the spirit of "Be bold," one of Wikipedia’s core mantras, he hit the "Start" button.

From October 19 to October 25, the bot operated under the auspices of his User:Ram-man account as it plodded through the list, starting with [[Autau-gaville, Alabama]] and working tirelessly for a whole week to finish. The articles all followed a similar format, taking raw numbers and putting them into slightly more palatable human-readable prose:

Autaugaville is a town located in Autauga County, Alabama. As of the 2000 census, the population of the town is 820.

GEOGRAPHY

The town has a total area of 20.5 km2 (7.9 mi2). 20.0 km2 (7.7 mi2) of it is land and 0.4 km2 (0.2 mi2) of it is water. The total area is 2.15% water.

DEMOGRAPHICS

As of 2000, there are 820 people, 316 households, and 219 families residing in the town. The population density is 41.0/km2 (106.1/mi2). There are 384 housing units at an average density of 19.2 persons/km2 (49.7 persons/mi2). The racial makeup of the town is 32.32% White, 65.98% African American, 0.24% Native American, 0.00% Asian, 0.00% Pacific Islander, 0.24% from other races, and 1.22% from two or more races. 0.98% of the population are Hispanic or Latino of any race.

There are 316 households out of which 34.5% have children under the age of 18 living with them, 39.6% are married couples living together, 25.0% have a woman whose husband does not live with her, and 30.4% are non-families. 28.5% of all households are made up of individuals and 13.6% have someone living alone who is 65 years of age or older. The average household size is 2.59 and the average family size is 3.18.

In the town the population is spread out with 31.1% under the age of 18, 8.9% from 18 to 24, 26.5% from 25 to 44, 20.2% from 45 to 64, and 13.3% who are 65 years of age or older. The median age is 33 years. For every 100 females there are 86.4 males. For every 100 females age 18 and over, there are 78.8 males.

The median income for a household in the town is $22,563, and the median income for a family is $35,417. Males have a median income of $29,688 versus $19,821 for females. The per capita income for the town is $12,586. 27.1% of the population and 27.4% of families are below the poverty line. Out of the total people living in poverty, 31.2% are under the age of 18 and 23.2% are 65 or older.

As it chugged along, people started to notice the gradual accumulation of articles, almost like a slowly rising flood within Wikipedia. Some thought it was a great deed, adding the crystal seeds needed to spur more activity for individual towns.

But it wasn’t all a warm reception for Ram-man.

Others viewed his work as an abomination—an unintelligent automaton systematically spewing rote text, fouling the collection of articles. Wikipedia was supposed to be a project started by humans and controlled by humans. Was an article where every other word was a number or a statistic a well-crafted start or simply a data dump?

There was no doubting the good intentions of Ramsey, a convivial programming whiz and part-time Church of the Brethren preacher. But debate would brew in the community about this massive bump in article count. Was it healthy to preen about the number of entries, knowing few of the new articles had been seen by human eyes or would be edited anytime soon? Some were skeptical about the value of entries on tiny towns of only a hundred or so people.

"According to [[Wikipedia:What Wikipedia is not]], Wikipedia articles are not ‘Mere collections of public domain or other source material.’ This article is a mere collection of the US census information. No links to this page, except the county page. I believe the demographics information to be useful, however, without some history, and intelligent writing to go along with it, it is quite useless," said user David Grant.36

Even though he faced vocal criticism for his mass addition, and did for some time, on balance Ramsey has no regrets.

"The rambot article spurned lots of policy discussions about what Wikipedia was: Should certain stubs be allowed? What types of articles are acceptable? Is a town of 1 person notable? Is a ghost town of 0 people notable?"

Subsequent discussions to try to delete or prune back the "Census articles" were heated but ultimately unsuccessful. The concept started to grow on people, and the novelty of finding one’s previously insignificant hometown in Wikipedia likely gave the project a boost as well. In the end, the majority of Wikipedians found the articles to be a huge step forward, providing the starter seeds for more activity.

"The point is that I wouldn’t have bothered to write any [of] my contributions, and probably many other users wouldn’t either, if Rambot hadn’t given me a starting point and some organization," said User:Meelar.

Ramsey’s 33,832-article addition, causing a 60 percent growth in one week, was by far the largest bump Wikipedia had seen or has since. Historic charts graphing Wikipedia’s growth always have a distinctive "Rambot spike" showing the one-week leap that English Wikipedia undertook in 2002.

The benefits didn’t come without some confusion. Ram-man’s human edits were lumped together with edits made by his software robot. Because other Wikipedians could not tell which was which, they really weren’t sure whether to criticize the person or the bot. Also, when people were reading the Recent Changes list to track community activity, it was completely flooded by Ram-man’s bot edits.

This spawned a new policy as a way to distinguish between humanity and the automatons: Special "bot accounts" would be registered to do bot actions. It was a good idea, as it would make it easier to identify, filter, and undo the mass edits.

Between them, Ram-man and Rambot chalked up more than 100,000 edits by the end of 2004. Ramsey’s additions and subsequent follow-up additions made him the top editor in Wikipedia by far.

Rambot has inspired many other bots to not just add articles, but also help to do mundane, repetitive tasks. Bots have been modified not just simply to work on their own in isolation, but also to be "manually assisted" by humans. Spelling is a good example of where the community of people managing bots in Wikipedia stated that the bots should not run on their own.

tmp4-11_thumb[1]

"There should be no bots that attempt to fix spelling mistakes in an unattended fashion. It is not technically possible to create such a bot that would not make mistakes, as there will always be places where non-standard spellings are in fact intended."37

Spellbots were one of the first developed to do automated checking of articles en masse. Volunteers were asked to help monitor Spellbots’ alarms and, by hand, approve or deny their spelling recommendations.

In practice, bot-assisted editing is somewhat like a hamster feeder, with an endless supply of food (or in this case, spelling mistakes) and Wikipedians scurrying about, grabbing pieces and processing each one.

One user, Lupin, programmed a live spellcheck to check spelling on every new article that was saved in Wikipedia. Users could volunteer to watch the output and correct any that caught their eye. The bot output looked like this:

Josiah Tongogara matched consistant . . .

Dana International matched wich . . .

Clicking on the errant word would allow one to correct the mistake manually.

It’s hard to imagine that Wikipedia could have scaled past 100,000 articles without the assistance of bots automating the tasks of filtering and sorting, and assisting the human editors.

Similarly, it’s hard to imagine today’s massive auto industry still requiring hand-assembly of autos as Ford did with the Model T. Repetitive tasks left to robots (or software robots in this case) allow human beings to do what they’re good at—decision making, redesigning, and adding new features.

The side effect, or piranha effect, was that Rambot’s additions did not just sit there gathering digital dust, entertaining occasional visitors. The basic county and town articles inspired others.

Next post:

Previous post: