Yann LeCun - Data Scientists at Work

Database Reference

In-Depth Information

for surgeons based on just abdominal pain. There are on the order of 20 dif-

ferent basic diagnoses you can make based on abdominal pain. Some of these

diagnoses require very quick surgery, like appendicitis, for example. And so I

was given a fairly large data set for the time, with thousands of samples, with

basic descriptions of patients, with missing values, and things like that, which

you would expect.

The people I talked to who had collected this data set had tried naïve Bayes

and similar approaches on it. I tried neural nets. Neural nets didn't exist yet,

but I basically tried this newfangled thing on it—back propagation—and I got

some pretty decent results. This helped me come up with the idea of tailoring

the architecture of the system so that it would be able to identify syndromes

and things like this, which are collections of symptoms, so as to reduce the

number of free parameters in the system, because we knew, even back then in

1986, that overfitting was a big issue.

Gutierrez: Was there a specific aha! moment when you grasped the power

of data?

LeCun: It was never about data for me. For me, data was and is a means to

an end. For me, it's always been about the power of the model you can train,

and so it's about learning algorithms. The wide availability of data came way,

way, way later—like 20 years after I started working on these questions. We

started having large data sets or decent-sized data sets for things like handwrit-

ing recognition or speech recognition in the 1990s. In fact, I published one of

those data sets—the MNIST data set, which is used very frequently for hand-

writing recognition. Now it's not considered big at all, but at the time it was.

The availability of data sets so large that you don't even have time to look at

any piece of data more than once because you have streaming data coming at

you is a very recent phenomenon. A lot of the methods that I am interested

in happen to scale very well in those situations, because I have always been

a believer in things like stochastic gradient descent and similar techniques.

These are things people use now after a hiatus of 10 years. People used other

methods that didn't scale very well because they weren't confronted with this

flow of data, and now that they have data of this size, they're now coming back

to these techniques.

Gutierrez: What conferences, papers, or books in AI research would you

recommend to someone just starting out?

LeCun: There are two different things I am interested in. One is AI or ambi-

tious machine learning, and the other one is what I'll call data science. However,

it's not the industry meaning of data science. What I mean by data science in

this context is really the general problem of extracting knowledge from data,

whether that is done automatically or semiautomatically, and whether we're

talking about the methods or the tools or the infrastructure, and whether

the data has to do with things like business or science or social science

Search WWH ::

Custom Search

Home