Databases Reference
In-Depth Information
The values of var2, var3, and var4 look to be similar between the filter group
containing the bulk of the data and the one containing the 19. With var5, there is
a greater difference. Although in this tutorial the reason for that difference will
not be pursued, when conducting a complete analysis, further investigation is
needed to assess the relationship between var5 values and the 19 that fall outside
the expected distribution with respect to var1.
For an example of sub-population-based outliers, let's return to the iris
dataset.
View in a parallel plot the iris.csv dataset introduced in Chapter 2.
With respect to PetalLength and PetalWidth, a sub-population of observa-
tions is visible. This sub-population was previously found to be explained by
Variety - they are all Setosa.
Create three filter groups - one for each variety.
One at a time, examine the three groups by hiding the other two.
When looking at just the Setosa variety, a potential outlier is visible near the
bottom of the SepalWidth axis. It should be further investigated.
This last example illustrates the need to separately evaluate sub-populations
within a dataset when those sub-populations are known a priori. In VisMiner,
the task is facilitated using the filter capabilities of the parallel plot.
Computed checks
When columns in a dataset may be derived from other columns, the validity of
all involved columns may be confirmed, although the identity of the offending
column may not be obvious.
Player statistics released by Major League Baseball provide an example. The
dataset mlbBatters2011.csv contains the end of season batting statistics for the
2011 season. The dataset as released was valid. However, in the dataset employed
here one observation has been altered in order to illustrate the process.
Open the dataset mlbBatters2011.csv and view the Summary Statistics.
To a domain expert familiar with baseball statistics, the following derivations
are identified:
BattingAvg ¼ Hits = AtBat
TotalBases ¼ Hits þ 3 HomeRuns þ 2 Triples þ Doubles
Slugging
¼
TotalBases
=
AtBat
 
Search WWH ::




Custom Search