Caitlin Smallwood - Data Scientists at Work

Database Reference

In-Depth Information

out as the commonalities across all those approaches. As we talked about

earlier, experience with different models, different data sets, and wrinkles with

data sets is hugely important.

Gutierrez: How does someone develop the skill to know how to choose

the right technique to apply to a problem?

Smallwood: It's about trying a lot of the different techniques and learning

some of the common pitfalls that you would come across with the different

techniques. I also think there's also a lot to be said for working in a collabora-

tive environment where you can show your approach to someone else and

hear their feedback on questions like: Why was that a good idea? Why was

that not a good idea?

It's hard to know if you're working in isolation on a model. You would have a

hard time knowing whether you built the right kind of model or not, because

the model will output something regardless if you modeled it correctly or not.

If you're cocky or full of ego, you'll just believe you did the right thing and not

stop to think about whether you actually did the right thing. It comes back

to being egoless and open-minded. So I think it's really hard to learn how to

choose the right technique to apply to a problem without getting feedback

from multiple people in the space who have experience as well. The more

people whom you can get feedback from over time, the better. I really think

that's a great way to progress.

Gutierrez: What advice is helpful for people moving into the field?

Smallwood: I would say to always bite the bullet with regard to understand-

ing the basics of the data first before you do anything else, even though it's

not sexy and not as fun. In other words, put effort into understanding how the

data is captured, understand exactly how each data field is defined, and under-

stand when data is missing. If the data is missing, does that mean something in

and of itself? Is it missing only in certain situations? These little, teeny nuanced

data gotchas will really get you. They really will.

You can use the most sophisticated algorithm under the sun, but it's the same

old junk-in-junk-out thing. You cannot turn a blind eye to the raw data, no

matter how excited you are to get to the fun part of the modeling. Dot your

i 's, cross your t i's, and check everything you can about the underlying data

before you go down the path of developing a model.

Another thing I've learned over time is that a mix of algorithms is almost

always better than one single algorithm in the context of a system, because dif-

ferent techniques exploit different aspects of the patterns in the data, especially

in complex large data sets. So while you can take one particular algorithm and

iterate and iterate to make it better, I have almost always seen that a combina-

tion of algorithms tends to do better than just one algorithm.

Search WWH ::

Custom Search

Home