Next-Generation Data Scientists, Hubris, and Ethics - Doing Data Science

Databases Reference

In-Depth Information

confident it would work. I was betting a lot on it. We had time. We

had resources. We had done what we thought would work, and it still

could have broken. Something could have happened.”

Debate over the value of “domain knowledge” has long polarized the

data community. Much of the promise of unsupervised learning, after

all, is overcoming a crippling dependence on our wonted categories

of social and scientific analysis, as seen in one of many celebrations

of the Obama analytics team. Daniel Wagner, the 29-year-old chief

analytics officer, said:

The notion of a campaign looking for groups such as “soccer

moms” or “waitress moms” to convert is outdated. Campaigns can

now pinpoint individual swing voters. White suburban women?

They're not all the same. The Latino community is very diverse

with very different interests. What the data permits you to do is to

figure out that diversity.

In productive tension with this escape from deadening classifications,

however, the movement to revalidate domain expertise within statis‐

tics seems about as old as formalized data mining.

In a now infamous Wall Street Journal article , Peggy Noonan mocked

the job ad for the Obama analytics department: “It read like politics

as done by Martians.” The campaign was simply insufficiently human,

with its war room both “high-tech and bloodless.” Unmentioned went

that the contemporaneous Romney ads read similarly.

Data science rests on algorithms but does not reduce to those algo‐

rithms. The use of those algorithms rests fundamentally on what so‐

ciologists of science call “tacit knowledge” —practical knowledge not

easily reducible to articulated rules, or perhaps impossible to reduce

to rules at all. Using algorithms well is fundamentally a very human

endeavor—something not particularly algorithmic.

No warning to young data padawans is as central as the many dangers

of overfitting, the taking of noise for signal in a given training set; or,

alternatively, learning too much from a training set to generalize

properly. Avoiding overfitting requires a reflective use of algorithms.

Algorithms are enabling tools requiring us to reflect more, not less.

In 1997 Peter Huber explained, “The problem, as I see it, is not one

of replacing human ingenuity by machine intelligence, but one of

assisting human ingenuity by all conceivable tools of computer sci‐

ence and artificial intelligence, in particular aiding with the improv‐

isation of search tools and with keeping track of the progress of an

analysis.”̄ The word 'improvisation' is just right in pointing to mastery

of tools, contextual reasoning, and the virtue of avoiding rote activity.

Search WWH ::

Custom Search

Home