Univariate Regression - Common Errors in Statistics

Information Technology Reference

In-Depth Information

the entire line to pass closely to the outlier. While a number of methods

exist for detecting the most influential observations (see, for example,

Mosteller and Tukey, 1977), influential does not automatically mean that

the data point is in error. Measures of influence encourage review of data

for exclusion. Statistics do not exclude data, analysts do. And they only

exclude data when presented firm evidence that the data are in error.

The problem of bad data is particularly acute in two instances:

1. When most of the data are at one end of the line, so that a few

observations at the far end can have undue influence on the

estimated model.

2. When there is no causal relationship between X and Y .

The Washington State Department of Social and Health Services extrap-

olates its audit results on the basis of a regression of over- and under-

charges against the dollar amount of the claim. Because the frequency of

errors depends on the amount of paper work involved and not on the

dollar amount of the claim, no linear relationship exists between over-

charges and the amount of the claim. The slope of the regression line can

vary widely from sample to sample; the removal or addition of a very few

samples to the original audit can dramatically affect the amount claimed

by the state in overcharges.

Recommended is the delete-one approach in which the regression coeffi-

cients are recomputed repeatedly deleting a single pair of observations

from the original data set each time. These calculations provide confidence

intervals for the estimates along with an estimate of the sensitivity of the

regression to outliers. When the number of data pairs exceeds 100, a

bootstrap might be used instead.

To get an estimate of the precision of the estimates and the sensitivity of the

regression equation to bad data, recompute the coefficients leaving out a dif-

ferent data pair each time.

Convenience

More often than we would like to admit, the variables and data that go

into our models are chosen for us. We cannot directly measure the vari-

ables we are interested in, so we make do with surrogates. But such surro-

gates may or may not be directly related to the variables of interest. Lack

of funds and/or the necessary instrumentation limit the range over which

observations can be made. Our census overlooks the homeless, the unco-

operative, and the less luminous. (See, for example, City of New York v.

Dept of Commerce, 2 Disney [1976], and Bothun [1998, Chapter 6].)

2

822 F. Supp. 906 (E.D.N.Y., 1993).

Search WWH ::

Custom Search

Home