Information Technology Reference
In-Depth Information
the entire line to pass closely to the outlier. While a number of methods
exist for detecting the most influential observations (see, for example,
Mosteller and Tukey, 1977), influential does not automatically mean that
the data point is in error. Measures of influence encourage review of data
for exclusion. Statistics do not exclude data, analysts do. And they only
exclude data when presented firm evidence that the data are in error.
The problem of bad data is particularly acute in two instances:
1. When most of the data are at one end of the line, so that a few
observations at the far end can have undue influence on the
estimated model.
2. When there is no causal relationship between
X
and
Y
.
The Washington State Department of Social and Health Services extrap-
olates its audit results on the basis of a regression of over- and under-
charges against the dollar amount of the claim. Because the frequency of
errors depends on the amount of paper work involved and not on the
dollar amount of the claim, no linear relationship exists between over-
charges and the amount of the claim. The slope of the regression line can
vary widely from sample to sample; the removal or addition of a very few
samples to the original audit can dramatically affect the amount claimed
by the state in overcharges.
Recommended is the
delete-one
approach in which the regression coeffi-
cients are recomputed repeatedly deleting a single pair of observations
from the original data set each time. These calculations provide confidence
intervals for the estimates along with an estimate of the sensitivity of the
regression to outliers. When the number of data pairs exceeds 100, a
bootstrap might be used instead.
To get an estimate of the precision of the estimates and the sensitivity of the
regression equation to bad data, recompute the coefficients leaving out a dif-
ferent data pair each time.
Convenience
More often than we would like to admit, the variables and data that go
into our models are chosen for us. We cannot directly measure the vari-
ables we are interested in, so we make do with surrogates. But such surro-
gates may or may not be directly related to the variables of interest. Lack
of funds and/or the necessary instrumentation limit the range over which
observations can be made. Our census overlooks the homeless, the unco-
operative, and the less luminous. (See, for example,
City of New York v.
Dept of Commerce,
2
Disney [1976], and Bothun [1998, Chapter 6].)
2
822 F. Supp. 906 (E.D.N.Y., 1993).