The Impact of JDM on IT Infrastructure - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

details, but also information such as the descriptive statistics of the

input attributes for display purposes. This is the case for some

clustering implementations, which are able to show the profiles for

all input attributes for all clusters. Other models may maintain cross-

statistics with the targets or all the input attributes.

In trying to optimize business performance, the number of

predictive and descriptive models resulting from data mining

techniques are increasing dramatically. For example, large telecom-

munications operators are now building more than 1,500 models a

year, for their CRM activity alone! And they will expand this meth-

odology later into risk management, which will dramatically

increase the number of models built per year. This impacts the disk

storage needed to persist not only the models, but also their build

settings, apply settings, and various result objects. When building

credit risk scoring models for example, it is not uncommon to have

quality processes in place that require these settings be kept: it may

be mandatory that a specific model's information can be retrieved,

such as when the model was produced, what dataset was used to

produce the model, and what were the test results. Often, a business

wants or needs to keep track of the different model versions. This is

considered an application level operation currently, because JDM

does not include model versioning. As such, users will likely design

naming conventions for their models, tasks, and datasets, and all the

objects that are saved in the mining object repository (MOR) to exter-

nally manage the versioning of persisted objects. Consult your DME

vendor for specifics concerning storage requirements for different

models.

15.4

Data Access

Data access is needed in three major phases of a model's life cycle:

(1) the build phase, where the model is created on a build dataset,

(2) the test phase, where metrics are computed to understand model

quality, and (3) the apply phase, where scores or forecasts are writ-

ten back to storage for later use. In these phases, data must be

accessed from the main repositories: data warehouses, operational

systems, and so on. When the DME is an independent server, there

must be data transfer between the data repository and the DME,

which increases network traffic. If IT management does not allow

mining the data in the actual data warehouses or operational data

stores, in-database and independent-server (direct data access) DME

Search WWH ::

Custom Search

Home