The Impact of JDM on IT Infrastructure - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

15.4.2

Data Access for Apply and Test

As noted earlier, a model can be applied to very large datasets,

especially in terms of number of cases. This is why the possibility of

exporting the model from the DME in a language that is supported

by the scoring engine (e.g., a database) can be used. For example,

IBM is proposing PMML interpreters for several databases; Tera-

data also proposes a PMML interpreter; and KXEN is able to export

models either in SQL or in User Defined Functions for all major

databases. In all these situations, the apply phases are done within

the execution environment (e.g., the database) without external

data transfer. This impacts the computing power of the database

servers but it has no impact on network traffic.

In contrast, in-database DMEs, such as Oracle, perform apply and

test at a layer below application-level SQL or User Defined Func-

tions. This type of access eliminates overheads for security, process,

and database data read overheads typically experienced by applica-

tion level code. As a result, in-database scoring can achieve better

performance than externally generated code.

15.5

Backup and Recovery

Backup and recovery plans are critical for any IT organization. IT staff

are already quite familiar with performing database and file system

backups on a regular basis. In-database DMEs or DMEs with database-

hosted MORs make backup and recovery a part of normal database

maintenance. Where models are maintained separately from the data-

base (e.g., as flat files in either PMML or proprietary formats), users

must rely on filesystem backups or ad hoc procedures.

15.6

Scheduling

In an IT environment where hardware is plentiful and software

scales well with the addition of hardware, the consideration of when

to execute data mining tasks can depend more on business require-

ments than on technical ones. However, most IT departments have

budget constraints, both in terms of hardware purchases and person-

nel to manage and maintain their hardware and software. To this

point, scheduling of data mining tasks becomes important to effi-

ciently use existing hardware. This section assumes that the data

Search WWH ::

Custom Search

Home