Databases Reference
In-Depth Information
been taken. When a reading of 98.6 was taken, nurses in recording the
temperature, frequently bumped it up or down a tenth of a degree to give
the appearance that it had actually been taken, yet remained well within the
“normal” range.
If this data is to be used in a data mining application, depending on the
methodology applied, it may be prudent to reassign some of the 98.5 and 98.7
temperatures back to 98.6. Algorithms such as artificial neural networks will
pick up on this distribution anomaly and distort the descriptive abilities of the
resulting models.
Pattern checks
With respect to patterns, the search is for outlying observations that do not
conform to the typical relationship between two or more attributes. The
correlation matrix coupled with the scatter plot are an effective combination
of viewers in the visual search for these outliers. Begin with an exploration of a
synthetic dataset.
View Table6.csv in a correlation matrix.
The correlation matrix is a good starting point in the identification of attribute
relationships. Using the correlation matrix we locate pairs of correlated
attributes, then in a synchronized scatter plot we evaluate the relationship.
In Table6 we see just one candidate - var1 versus var2.
View Table6.csv in a Scatter Plot.
In the correlation matrix, click on the var1-var2 cell to bring it up in the
scatter plot. (See Figure 3.5.)
Do you see the outlier? Notice that if one were to look independently at
distributions on var1 and var2, the outlier would not be found. It is hidden within
the single-dimension distributions of each. The outlier only surfaces when the
two-dimensional plot pairing both attributes is examined.
As a matter of practice, in conducting pattern checks, open the dataset in
question in both a correlation matrix and a synchronized scatter plot. System-
atically click from cell to cell in the correlation matrix on attributes combina-
tions with potentially meaningful correlations, examining the corresponding
scatter plot as you progress.
Visually locating outliers based on multi-attribute relationships,
is also
possible in the parallel plot.
View Table6.csv in a parallel plot.
 
Search WWH ::




Custom Search