Databases Reference
In-Depth Information
To confirm the validity of column values involved in the computation of
TotalBases:
Right-click on mlbBatters2011.csv; select “Create derived dataset”.
Name the new dataset mlbCheck.
Click “Select All”.
Define a new column named “TotalBasesChk” with a formula of:
Hits þ 3 HomeRuns þ 2 Triples þ Doubles
The above new column uses the four inputs (Hits, HomeRuns, Triples, and
Doubles) to recompute TotalBases. In order to not conflict with the existing
TotalBases column, it is given the name “TotalBasesChk”.
Define a new column named “TotalBasesDiff” with a formula of:
TotalBases ð Hits þ 3 HomeRuns þ 2 Triples þ Doubles Þ
If all entries are valid, the values of the existing TotalBases column and those
of the computed TotalBasesChk should be identical. Thus the difference
between the two (TotalBasesDiff) should be zero.
Click “Create” to create the derived set.
View mlbCheck in a parallel plot.
Look at the connecting lines between TotalBases and TotalBasesChk. When
all are horizontal, it is a good indication that the recorded values are correct.
A stronger statement of correctness can be made when all connecting line
segments are horizontal,
the minimum values match, and the maximum
values match.
If all the data is valid, the difference between the recorded value and the
computed value should be zero. In the plot, focus attention on the column
TotalBasesDiff - the difference between the recorded value (TotalBases) and
the computed value (TotalBasesDiff). To reduce clutter, all but the relevant
columns (Doubles, Hits, HomeRuns, Triples, TotalBases, and TotalBasesChk)
may be hidden.
In looking at the column TotalBasesDiff. All but one of the differences are
zero. In the parallel plot, that loner stands out as an outlier. The difference
cannot be explained by rounding, since the computation includes integer
addition and multiplication only. There is definitely an error in at least one
of the involved column values.
Search WWH ::




Custom Search