Databases Reference
In-Depth Information
But we always have more than one way of doing this translation
—more than one possible model, more than one associated met‐
ric, and possibly more than one optimization. So the science in
data science is—given raw data, constraints, and a problem state‐
ment—how to navigate through that maze and make the best
choices. Every design choice you make can be formulated as an
hypothesis, against which you will use rigorous testing and ex‐
perimentation to either validate or refute.
This process, whereby one formulates a well-defined hypothesis
and then tests it, might rise to the level of a science in certain cases.
Specifically, the scientific method is adopted in data science as
follows:
• You hold on to your existing best performer.
• Once you have a new idea to prototype, set up an experiment
wherein the two best models compete.
• Rinse and repeat (while not overfitting).
Classifiers
This section focuses on the process of choosing a classifier . Classifi‐
cation involves mapping your data points into a finite set of labels or
the probability of a given label or labels. We've already seen some ex‐
amples of classification algorithms, such as Naive Bayes and k-nearest
neighbors (k-NN), in the previous chapters. Table 5-1 shows a few
examples of when you'd want to use classification:
Table 5-1. Classifier example questions and answers
“Will someone click on this ad?”
0 or 1 (no or yes)
“What number is this (image recognition)?”
0, 1, 2, etc.
“What is this news article about?”
“Sports”
“Is this spam?”
0 or 1
“Is this pill good for headaches?”
0 or 1
From now on we'll talk about binary classification only (0 or 1).
In this chapter, we're talking about logistic regression, but there's other
classification algorithms available, including decision trees (which
we'll cover in Chapter 7 ), random forests ( Chapter 7 ), and support
Search WWH ::




Custom Search