and even more obscure “moments” of the attribute, such as the skew
and kurtosis on continuous numerical data like age or income . Other
statistics, such as the most frequently occurring value, or mode , apply
to discrete or string data, such as marital status or satisfaction level.
Since these are computed on individual attributes, they are termed
univariate statistics, that is, they operate on a single attribute. The first
release of JDM put in place a framework for addressing this type of
statistics, as discussed in Chapter 7.
In JDM 2.0, the expert group is expanding the framework to
include statistical calculations involving pairs of attributes, termed
bivariate or more generally multivariate statistics . As with transforma-
tions, users should be able to specify many combinations of tests
succinctly, across many attributes. For example, users can specify a
set of predictor (independent) attributes and one or more target
(dependent) attributes, and the system computes the requested sta-
tistical functions on the cross-product of independent and dependent
As with other types of objects in JDM, the statistics framework is
intended to be extensible to include statistical functions not speci-
fied in the standard. Some of the multivariate statistical functions
under consideration include F and T statistics, Kolmogorov-
Smirnov, Mann-Whitney, and one-way ANOVA. These types of sta-
tistical functions can help users understand relationships in data
prior to model building or when evaluating model quality.
In JDM, univariate and multivariate statistics can also be pro-
duced as a by-product of the model building process. By expanding
the JDM AttributeStatisticsSet interface, vendors can immediately
associate multivariate statistics with models as is currently possible
for univariate statistics.
If a data miner wants to predict the probability with which
customers will purchase each of 1,200 products, there are a few
approaches. The data miner could define a classification problem
with a target attribute containing the product purchased. However,
building such models may require a very large number of customers.
If insufficient data exists, model quality could be poor. Moreover, this
type of model likely will not reflect when a given customer pur-
chases multiple products.