Eric Jonas - Data Scientists at Work

Database Reference

In-Depth Information

Jonas: If I'm not familiar with data, then I generally don't even start. I recently

met Winfried Denk, who invented two-photon microscopy and is a very smart

applied-physicist guy who's received many, many, many awards. His comment

to me in this area was that the number-one thing you have to be able to do

is actually know what questions to ask. And so I try not to get involved in

projects where I don't know what the right questions are. And then gener-

ally, if I know the questions, I understand the data well enough to then start

thinking about the modeling. The nice thing about modeling is that you can

fairly rapidly turn around and try a bunch of different things. But if you haven't

even looked at the data and done the most basic things, then it's very easy to

be led astray.

Gutierrez: How do you look at the data?

Jonas: Matplotlib in Python. I make a bunch of initial plots and then play

around with the data. A lot of the data I work with looks very different from

the kinds of data that show up more on the industry side of things. No one in

science really uses a relational database, because we either have time series,

or graphs, or images, or all these weird things. Rarely do we get relational

facts. So I don't end up using SQL that much. It's much more about writing a

bunch of custom scripts to parse through 100 gigabytes of time-series data

and look at different spectral bands or something similar.

Gutierrez: What do you look for in other people's work?

Jonas: On the research side, my answer is different from many of the people

I work with and other people in the field. One of my colleagues told me, that

I read more papers than anyone they know. I don't actually really read most of

the papers. I read the title and the abstract, look at the figures, and then move

on. For example, when I evaluate machine learning papers, what I am looking

to find out is whether the technique worked or not. This is something that

the world needs to know—most papers don't actually tell you whether the

thing worked. It's really infuriating because most papers will show five dataset

examples and then show that they're slightly better on two different metrics

when comparing against something from 20 years ago. In academia, it's fine. In

industry, it's infuriating, because you need to know what actually works and

what doesn't.

So a lot of what I look for are: “Do I think that their approach was valid? Do I

know them?” The degree to which I will read papers from people I know and

trust far is far higher than those whom I don't know. People complain that

it's hard for new people to break into fields. Well, that's partly because at any

given time, 99 percent of the time people are all new and they're cranks. So a

lot of it is: “Do I find the structure of this model to be interesting? Do I think

they did inference properly? Did they ask the basic questions? Do I believe

those results? Is the answer something that I would have believed before

Search WWH ::

Custom Search

Home