Databases Reference
In-Depth Information
p-values
In the context of regression where you're trying to estimate coef‐
ficients (the β s), to think in terms of p-values, you make an as‐
sumption of there being a null hypothesis that the β s are zero. For
any given β , the p-value captures the probability of observing the
data that you observed, and obtaining the test-statistic (in this case
the estimated β ) that you got under the null hypothesis . Specifi‐
cally, if you have a low p-value, it is highly unlikely that you would
observe such a test-statistic if the null hypothesis actually held.
This translates to meaning that (with some confidence) the coef‐
ficient is highly likely to be non-zero.
AIC (Akaike Infomation Criterion)
Given by the formula 2 k −2 ln L , where k is the number of pa‐
rameters in the model and ln L is the “maximized value of the
log likelihood.” The goal is to minimize AIC.
BIC (Bayesian Information Criterion)
Given by the formula k * ln n −2 ln L , where k is the number of
parameters in the model, n is the number of observations (data
points, or users), and ln L is the maximized value of the log like‐
lihood. The goal is to minimize BIC.
Entropy
This will be discussed more in “Embedded Methods: Decision
Trees” on page 184 .
In practice
As mentioned, stepwise regression is exploring a large space of all
possible models, and so there is the danger of overfitting—it will often
fit much better in-sample than it does on new out-of-sample data.
You don't have to retrain models at each step of these approaches,
because there are fancy ways to see how your objective function (aka
selection criterion) changes as you change the subset of features you
are trying out. These are called “finite differences” and rely essentially
on Taylor Series expansions of the objective function.
One last word: if you have a domain expert on hand, don't go into the
machine learning rabbit hole of feature selection unless you've tapped
into your expert completely!
Search WWH ::




Custom Search