Information Technology Reference
In-Depth Information
A CONJECTURE
A great deal of publicity has heralded the arrival of new and more power-
ful data mining methods—neural networks, CART, and dozens of unspeci-
fied proprietary algorithms. In our limited experience, none of these have
lived up to expectations; see a report of our tribulations in Good [2001a,
Section 7.6]. Most of the experts we've consulted have attributed this
failure to the small size of our test data set, 400 observations each with 30
variables. In fact, many publishers of data mining software assert that their
wares are designed solely for use with terra-bytes of information.
This observation has led to our putting our experience in the form of
the following conjecture.
If m points are required to determine a univariate regression line with
sufficient precision, then it will take at least m n observations and perhaps
n ! m n observations to appropriately characterize and evaluate a model with
n variables.
BUILDING A SUCCESSFUL MODEL
“Rome was not built in one day,” 4 nor was any reliable model. The only
successful approach to modeling lies in a continuous cycle of hypothesis
formulation-data gathering-hypothesis testing and estimation. How you
go about it will depend on whether you are new to the field, have a small
data set in hand, and are willing and prepared to gather more until the job
is done, or you have access to databases containing hundreds of thousands
of observations. The following prescription, while directly applicable to
the latter case, can be readily modified to fit any situation.
1. A thorough literature search and an understanding of casual mech-
anisms is an essential prerequisite to any study. Don't let the soft-
ware do your thinking for you.
2. Using a subset of the data selected at random, see which variables
appear to be correlated with the dependent variable(s) of interest.
(As noted in this and the preceding chapter, two unrelated vari-
ables may appear to be correlated by chance alone or as a result of
confounding factors. For the same reasons, two closely related
factors may fail to exhibit a statistically significant correlation.)
3. Using a second, distinct subset of the data selected at random, see
which of the variables selected at the first stage still appear to be
correlated with the dependent variable(s) of interest. Alternately,
use the bootstrap method describe by Gong [1986] to see which
variables are consistently selected for inclusion in the model.
4. Limit attention to one or two of the most significant predictor
variables. Select a subset of the existing data which the remainder
4
John Heywood, Proverbes , Part i, Chapter xi, 16th Century.
Search WWH ::




Custom Search