Graphics Reference
In-Depth Information
Hundreds of Variables
14.2.4
Onefrequentlyaskedquestionis“howmanyvariablescanbehandledwith
-coords?”
he largest dataset that I have effectively worked with had about variables and
, data entries. Using various techniques developed over the years and the au-
tomatic classifier discussed in the next section, it is possible to handle much larger
datasets. Still, the relevant admonition is:
be sceptical about the quality of datasets with large numbers of variables.
When hundreds or more variables are involved, it is unlikely that there are many
people around who have a good feel for what is happening (as confirmed by my own
experience). A case in point is the dataset shown in Fig. . , consisting of instru-
mentation measurements of a complex process. An immediate observation was that
many of the instruments recorded throughout the period that the measurements
were taken, something which had not been noticed previously. Another curiosity
was the series of repetitive patterns on the right. It turns that several variables were
measured in more than one location using different names. When the dataset was
cleaned up (removing superfluous information), it was initially reduced to about
variables, as shown in Fig. . , and eventually to about that contained the infor-
mation of real interest. Bymy tracking, the phenomenon of repetitive measurements
is widespread, with at least % of the variables in large datasets being duplicates or
near-duplicates, possibly due to instrumental nonuniformities, as suggested by the
two-variable scatterplot in Fig. . . Here, the repetitive observations were easily
detected due to the fortuitous variable permutation in the display. Since repetitive
Figure . . Manufacturing process measurements: variables
Search WWH ::




Custom Search