The Impact of JDM on IT Infrastructure - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

scoring requirements. In some circumstances, there are also incre-

mental model building requirements that need to be factored in.

Consider an application that supports cross-sell—the recommen-

dation of products to individual customers. One possible technique

is to build a predictive model for each product to be recommended. If

a business has 100 products, this would require the building of 100

models. If customer preferences change frequently, these models

may need to be rebuilt, or refreshed, weekly or even daily. Let's say

each model is built based on a dataset with 200 attributes and 500,000

cases. If a single model build takes 15 minutes to complete on a par-

ticular machine, that means that for the 100 models, this process

would take 25 hours to complete if executed serially. If the objective

is to rebuild these each night based on the previous day's data for use

the following day, multiple such machines can be employed to allow

building in parallel. Let's say we have a window of 5 hours in which

to build the models; that would require 5 such machines, each

building 20 models.

Model building is often performed on a much smaller number of

cases in comparison to the number of cases used when applying

models to data for making predictions, or scoring . Some businesses

require scoring their entire customer base, where the number of

customers may reach beyond 100 million. Moreover, such busi-

nesses often have a tight time window in which they can score these

customers. The number of models applied to each customer may

also increase the performance demands. Consider that for the 100

models built in the previous example, each model will have to be

applied to each of the 100 million customers. Moreover, this may

need to be completed overnight in an 8-hour time window. If a

DME can score a million customers in 6 seconds on a given model,

then it will take 600 seconds (10 minutes) to score those customers

on all 100 models. With 100 million customers, this will take

1,000 minutes (16.7 hours). To ensure that scoring can be accom-

plished within the 8-hour time window, the data could be divided

among three machines, which will allow the scoring to be

completed in less than 6 hours.

This scenario assumes that the data mining steps have already

been defined and are coded for repeatability. Another hardware con-

sideration is the impact on the model building process when data

miners are trying to come up with the appropriate data mining steps.

More computing hardware can certainly speed up the data transfor-

mation, analysis, and model building of a data miner, which means

long delays in seeing results will be avoided. However, many times

Search WWH ::

Custom Search

Home