The Impact of JDM on IT Infrastructure - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

We distinguish among these DME architectures because the first

and second cases do not require extra data storage. The third case,

involving staging data, does of course require additional disk space

as data is replicated for the purposes of mining. 1 So the question is,

how much?

The correct answer is, “It depends….” Consider an example

involving customer relationship management (CRM). Today, it is not

uncommon to find build datasets with between 50 and 5,000

attributes, the median value being around 200; and the number of

cases is often between 10,000 and 1,000,000, the median value being

around 200,000. Whereas these are only the build datasets, the vol-

umes for apply datasets are much larger. As cited earlier, a large

organization as found in the telecommunications or banking indus-

tries could have a customer database dealing with 100 million cus-

tomers. So we see here that corporate architectures requiring data

staging require large disk capacity to store not only the data to be

scored, but also the apply results.

Another impact could depend on specific DME requirements

for dataset format. For example, a specific association rule algo-

rithm could require the data in a multirecord case representation.

In this situation, a single record representation for a million records

with 100 attributes of 20 bytes each requires 2 gigabytes of space,

but the corresponding multirecord case representation requires

7 gigabytes, assuming the case ID is 20 bytes and the attribute name

is 30 bytes.

Modeling in the large also requires saving the models, the build

settings, and all the objects declared as persistent by the DME imple-

mentation. In most cases, a model should be (much) smaller than its

corresponding build dataset, because a model should extract knowl-

edge from and find relationships in data. This is especially true when

looking at the model details (i.e., the minimum information needed

to apply a model), such as the tree nodes for a decision tree or the

coefficients of a linear regression. Other models, such as support vec-

tor machine (SVM) for example, can take relatively more storage

because they maintain the support vectors; similarly, an association

rules model can contain hundreds of thousands of rules. In another

aspect, some implementations keep not only the minimum model

1

Note that when considering a tool's performance, the time required to export

data from the database and import results back to the database must be

included in the overall model build and apply time.

Search WWH ::

Custom Search

Home