Preview of Java Data Mining 2.0 - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

and even more obscure “moments” of the attribute, such as the skew

and kurtosis on continuous numerical data like age or income . Other

statistics, such as the most frequently occurring value, or mode , apply

to discrete or string data, such as marital status or satisfaction level.

Since these are computed on individual attributes, they are termed

univariate statistics, that is, they operate on a single attribute. The first

release of JDM put in place a framework for addressing this type of

statistics, as discussed in Chapter 7.

In JDM 2.0, the expert group is expanding the framework to

include statistical calculations involving pairs of attributes, termed

bivariate or more generally multivariate statistics . As with transforma-

tions, users should be able to specify many combinations of tests

succinctly, across many attributes. For example, users can specify a

set of predictor (independent) attributes and one or more target

(dependent) attributes, and the system computes the requested sta-

tistical functions on the cross-product of independent and dependent

attributes.

As with other types of objects in JDM, the statistics framework is

intended to be extensible to include statistical functions not speci-

fied in the standard. Some of the multivariate statistical functions

under consideration include F and T statistics, Kolmogorov-

Smirnov, Mann-Whitney, and one-way ANOVA. These types of sta-

tistical functions can help users understand relationships in data

prior to model building or when evaluating model quality.

In JDM, univariate and multivariate statistics can also be pro-

duced as a by-product of the model building process. By expanding

the JDM AttributeStatisticsSet interface, vendors can immediately

associate multivariate statistics with models as is currently possible

for univariate statistics.

18.6

Multi-target Models

If a data miner wants to predict the probability with which

customers will purchase each of 1,200 products, there are a few

approaches. The data miner could define a classification problem

with a target attribute containing the product purchased. However,

building such models may require a very large number of customers.

If insufficient data exists, model quality could be poor. Moreover, this

type of model likely will not reflect when a given customer pur-

chases multiple products.

Search WWH ::

Custom Search

Home