Databases Reference
In-Depth Information
Figure 6.2
Dataset Scatter Plots (X vs Y)
2 is 0.67, the intercept is 3.00 and the coefficient (slope) of
The value of
is
0.50. Yet, when one compares scatter plots of the four datasets, each set has its
own unique characteristics (see Figure 6.2).
Dataset (a) appears to contain points scattered around an upward sloping
imaginary line. This is what one would typically expect to find in a dataset with
a linear relationship between variables. In dataset (b), the points appear to
perfectly fit a curve. The perfect fit indicates that a linear model is not the best to
fit the data and that R
R
x
2 may be improved by fitting a curvilinear model. Dataset
(c) looks like it should generate a perfect linear fit except for the one outlying
point. If the outlier is removed, R
2 should be 1.0. Dataset (d) does not appear to
have a distribution that would support a fit to any kind of model. All but one of
the x values are 8. It is this single outlier that dictates the slope and intercept of
the fitted line. If, for example, it had a y value of 2.5 instead of 12.5, the slope
would be negative instead of positive as the fitted line would pass through this
point no matter where it appeared on the Y axis.
The examples of Anscombe highlight the need to explore and understand the
nature of the data before choosing and applying a modeling technique. Dataset
(b) needs to have a non-linear modeler applied; in dataset (c), outliers should be
removed before processing; and since dataset (d) does not suggest any kind of
Search WWH ::




Custom Search