Java Reference
In-Depth Information
scoring requirements. In some circumstances, there are also incre-
mental model building requirements that need to be factored in.
Consider an application that supports cross-sell—the recommen-
dation of products to individual customers. One possible technique
is to build a predictive model for each product to be recommended. If
a business has 100 products, this would require the building of 100
models. If customer preferences change frequently, these models
may need to be rebuilt, or refreshed, weekly or even daily. Let's say
each model is built based on a dataset with 200 attributes and 500,000
cases. If a single model build takes 15 minutes to complete on a par-
ticular machine, that means that for the 100 models, this process
would take 25 hours to complete if executed serially. If the objective
is to rebuild these each night based on the previous day's data for use
the following day, multiple such machines can be employed to allow
building in parallel. Let's say we have a window of 5 hours in which
to build the models; that would require 5 such machines, each
building 20 models.
Model building is often performed on a much smaller number of
cases in comparison to the number of cases used when applying
models to data for making predictions, or scoring . Some businesses
require scoring their entire customer base, where the number of
customers may reach beyond 100 million. Moreover, such busi-
nesses often have a tight time window in which they can score these
customers. The number of models applied to each customer may
also increase the performance demands. Consider that for the 100
models built in the previous example, each model will have to be
applied to each of the 100 million customers. Moreover, this may
need to be completed overnight in an 8-hour time window. If a
DME can score a million customers in 6 seconds on a given model,
then it will take 600 seconds (10 minutes) to score those customers
on all 100 models. With 100 million customers, this will take
1,000 minutes (16.7 hours). To ensure that scoring can be accom-
plished within the 8-hour time window, the data could be divided
among three machines, which will allow the scoring to be
completed in less than 6 hours.
This scenario assumes that the data mining steps have already
been defined and are coded for repeatability. Another hardware con-
sideration is the impact on the model building process when data
miners are trying to come up with the appropriate data mining steps.
More computing hardware can certainly speed up the data transfor-
mation, analysis, and model building of a data miner, which means
long delays in seeing results will be avoided. However, many times
Search WWH ::

Custom Search