Databases Reference
In-Depth Information
This is where you typically start in a standard statistics class,
with a clean, orderly dataset. But it's not where you typically
start in the real world.
Once we have this clean dataset, we should be doing some kind of
EDA. In the course of doing EDA, we may realize that it isn't actually
clean because of duplicates, missing values, absurd outliers, and data
that wasn't actually logged or incorrectly logged. If that's the case, we
may have to go back to collect more data, or spend more time cleaning
the dataset.
Next, we design our model to use some algorithm like k-nearest
neighbor (k-NN), linear regression, Naive Bayes, or something else.
The model we choose depends on the type of problem we're trying to
solve, of course, which could be a classification problem, a prediction
problem, or a basic description problem.
We then can interpret, visualize, report, or communicate our results.
This could take the form of reporting the results up to our boss or
coworkers, or publishing a paper in a journal and going out and giving
academic talks about it.
Alternatively, our goal may be to build or prototype a “data product”;
e.g., a spam classifier, or a search ranking algorithm, or a recommen‐
dation system. Now the key here that makes data science special and
distinct from statistics is that this data product then gets incorporated
back into the real world, and users interact with that product, and that
generates more data, which creates a feedback loop.
This is very different from predicting the weather, say, where your
model doesn't influence the outcome at all. For example, you might
predict it will rain next week, and unless you have some powers we
don't know about, you're not going to cause it to rain. But if you instead
build a recommendation system that generates evidence that “lots of
people love this topic,” say, then you will know that you caused that
feedback loop.
Take this loop into account in any analysis you do by adjusting for any
biases your model caused. Your models are not just predicting the
future, but causing it!
A data product that is productionized and that users interact with is
at one extreme and the weather is at the other, but regardless of the
Search WWH ::




Custom Search