Database Reference
In-Depth Information
I like to explain it like matchmaking. First, there is the problem you are trying
to solve. Second, there is the data that you may or may not have or want to
get. And lastly, there is the algorithm. My primary challenge as a data scientist
is to use the right algorithm to connect the right data to the problem you
actually want solved. However, by trying to match these three things up, it
may also mean that the problem cannot be addressed with any algorithm
I am aware of. It may also mean that we might have the wrong data. And finally,
there is still the question of whether that problem we are solving is relevant,
well specified, and is the right problem to work on in the first place. And the
best way to perform this iteration between problem, data, and algorithm, is
that you need to have a team of business people and data scientists working
together. The data and algorithm knowledge resides with the data scientists,
but to be able to really connect it to what business problems you want to
solve, it is great if you can bring the data scientists into the room and have
them be part of the discussion from the start.
An example comes to my mind. Recently somebody came to me and asked,
“What is the average age of the cookies that we are seeing?” which on the
surface sounds like a meaningful question, except that it is not actually a very
meaningful question. To answer this question, I can come up with any number
between one hour and three months. Not only that, each time period answer
would be justified. The reason there is so much spread is that if somebody has
third-party cookies disabled on their computer, it looks as if the cookie lives
for zero seconds, so I write the cookie, it just never comes back.
Now, the question is, do I count them or not? If they are part of the aver-
age, we are now talking of a really long-tailed distribution with a huge spike
at zero. Averages are meaningless for a long-tailed distribution with a spike
somewhere. If I leave the zero in, the answer is an hour. If you ask me what is
the average age of stable cookies that we see at least twice, then my answer
will jump from one hour to three months. So you can see that averages are
meaningless without more specific information.
So when someone asks me questions like this, my response is to ask a series
of questions: Why do you want to know? What are you going to do with that
thing that I am telling you? What are you going to use it to do? What deci-
sion or decisions are you going to make based on my answer? I am not going
to make a statement or answer the question until I understand what you are
doing.
Gutierrez: Is this part of the reason why it is helpful to have data scientists
part of the conversation from the beginning?
 
Search WWH ::




Custom Search