Information Technology Reference
In-Depth Information
V. Examples
As an example of the Q score, the Khan data set of Fig. 13.8 has a Q
score of 0.619. Beyond the Khan SRBCT data we further evaluated these
methods using several publicly available data sets: Spam [16], Cars [15],
Htong [15], and Transfusion [17]. These results are summarized in Table 1.
As with the Khan SRBCT data, we selected the top dimensions per
class using the mean ratio, and calculated the Q using the Voronoi
partition of the RadViz image. In the case of data sets with a low number
of classes we increased the maximum possible number of dimensions
selected. The Cars data set (Fig. 13.9) has a Q score of -0.577 when using
the CDL with AAM placement. This is quite low on our scale, however,
this is a 33% improvement over the best RadViz image in which all
dimensions are used and placed uniformly on the circle. We see that Fig.
13.10 has a Q of -0.862. Thus, selecting dimensions and arranging these
dimensions by class with the CDL+AAM has led to a sizeable
improvement. A Q of -0.577 is the highest score of any RadViz image of
this data set we have seen to this date.
These results for this data set match those of Bertini and Santucci [8].
Using parallel coordinates in their example they note the “correlation that
exists between number of cylinders/weight and weight/horsepower.” We
obtained similar results. We see a cluster of data images being pulled into
the Voronoi regions corresponding to horsepower, cylinders, and weight.
The mean ratio further informs us, by its inherent class distinction, that
this relationship is strongest for cars manufactured in the USA. That
observation is repeatedly confirmed by many others who have examined
this popular data set. We also observe that while the mean ratio also tracks
common observations about which dimensions are most associated with
European and Japanese cars, the dirth of points being pulled into the
Voronoi regions associated with these classes confirms other comments
about this data set; the clusters associated with European and Japanese
cars are difficult to distinguish.
The Spam data set [16] is a more contemporary data set. The 57
dimensions here are various word and character frequencies. Each record
corresponds to an email message. This data is used as a basis for
classifying the message as “Spam” or “Not Spam.” As labels we chose, for
clarity of example, the dimension index. The Spam example (Fig. 13.11)
with Q = 0.380 is less ambiguous in demonstrating the strength of the pull
of the dimensions selected as most significant to each of the two classes.
Search WWH ::




Custom Search