Information Technology Reference
In-Depth Information
personal and sensitive information about the students, and attempts to completely anonymize large
datasets have fallen short in the past (such as the AOL search data released in 2006 and the NetFlix
dataset in 2008). So I decided to investigate.
I downloaded the publicly available codebook of the dataset (gaining access to the data itself required
approval by the researchers) and also started examining various articles and public comments made
about the research project. An examination of the codebook revealed the source was a private, coed-
ucational institution, whose class of 2009 initially had 1,640 students in it. Elsewhere, the source was
described as a “New England” school. A search through an online college database revealed only seven
private, coed colleges in New England states (Connecticut, Maine, Massachusetts, New Hampshire,
Rhode Island, Vermont) with total undergraduate populations between 5,000 and 7,500 students (a
likely range if there were 1,640 in the 2006 freshman class): Tufts University, Suffolk University, Yale
University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.
The codebook also listed the majors represented in the dataset, which included unique descriptors,
such as Near Eastern Languages and Civilizations, Studies of Women, Gender and Sexuality, and
Organismic and Evolutionary Biology. A quick search revealed that only Harvard provides these degree
programs. The identification of Harvard College was further confirmed after analysis of a June 2008
video presentation by one of the researchers, where he noted that “midway through the freshman year,
students have to pick between one and seven best friends” that they will essentially live with for the
rest of their undergraduate career. This describes the unique method for determining undergraduate
housing at Harvard: all freshmen who complete the fall term enter into a lottery, where they can
designate a “blocking group” of between two and eight students with whom they would like to be
housed in close proximity. I was able to confirm this, again, through a simple Web search.
The announcement of this likely identification of the source of the T3 dataset did not prompt a public
reply by the research team, but within a week of the discovery the dataset was pulled from the publicly
available repository.
Why does it matter that you were able to determine the subjects of the T3 study were
Harvard students?
There are two primary concerns. First, there is the issue of possibly being able to identify particular
subjects in the dataset. The researchers took care to remove obvious identifiable data (names, email
addresses, etc.), but now that the source of the dataset had been determined, it might be easier to
identify unique individuals. For example, the codebook reveals that there is only one person in the
dataset from each of the states of Delaware, Louisiana, Mississippi, Montana, and Wyoming. Some
time in front of a search engine might reveal the identity of that one student that the state of Delaware
sent to Harvard in 2006. Once we've identified that student, we can now connect her with her personal
data elements in the dataset. In short, the privacy of the subjects in the database is at risk.
My other concern is actually greater: that the researchers felt their methodology was sufficient. There
were a number of good-faith steps taken by the research team, but each fell short. The research team has
defended itself by noting it only gathered Facebook information that was already publicly accessible.
However, the team utilized Harvard graduate students to access and retrieve the profile data. At the
time of the study, it was possible for Facebook users to restrict access to their profiles to people only
within their home university. Thus it is entirely possible that the research team had privileged access
to a profile by virtue of being within the Harvard network, while the general public would have been
locked out by the user's privacy settings. Researchers must avoid such cavalier positions: just because
something happens to be accessible on a social media site does not mean that it is free for the taking,
no questions asked.
 
Search WWH ::




Custom Search