Multivariable Regression - Common Errors in Statistics

Information Technology Reference

In-Depth Information

A CONJECTURE

A great deal of publicity has heralded the arrival of new and more power-

ful data mining methods—neural networks, CART, and dozens of unspeci-

fied proprietary algorithms. In our limited experience, none of these have

lived up to expectations; see a report of our tribulations in Good [2001a,

Section 7.6]. Most of the experts we've consulted have attributed this

failure to the small size of our test data set, 400 observations each with 30

variables. In fact, many publishers of data mining software assert that their

wares are designed solely for use with terra-bytes of information.

This observation has led to our putting our experience in the form of

the following conjecture.

If m points are required to determine a univariate regression line with

sufficient precision, then it will take at least m n observations and perhaps

n ! m n observations to appropriately characterize and evaluate a model with

n variables.

BUILDING A SUCCESSFUL MODEL

“Rome was not built in one day,” 4 nor was any reliable model. The only

successful approach to modeling lies in a continuous cycle of hypothesis

formulation-data gathering-hypothesis testing and estimation. How you

go about it will depend on whether you are new to the field, have a small

data set in hand, and are willing and prepared to gather more until the job

is done, or you have access to databases containing hundreds of thousands

of observations. The following prescription, while directly applicable to

the latter case, can be readily modified to fit any situation.

1. A thorough literature search and an understanding of casual mech-

anisms is an essential prerequisite to any study. Don't let the soft-

ware do your thinking for you.

2. Using a subset of the data selected at random, see which variables

appear to be correlated with the dependent variable(s) of interest.

(As noted in this and the preceding chapter, two unrelated vari-

ables may appear to be correlated by chance alone or as a result of

confounding factors. For the same reasons, two closely related

factors may fail to exhibit a statistically significant correlation.)

3. Using a second, distinct subset of the data selected at random, see

which of the variables selected at the first stage still appear to be

correlated with the dependent variable(s) of interest. Alternately,

use the bootstrap method describe by Gong [1986] to see which

variables are consistently selected for inclusion in the model.

4. Limit attention to one or two of the most significant predictor

variables. Select a subset of the existing data which the remainder

4

John Heywood, Proverbes , Part i, Chapter xi, 16th Century.

Search WWH ::

Custom Search

Home