Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

Sampling

Companies are amassing huge volumes of data on everything from

manufacturing processes and product defects to maintaining a 360-

degree view of operations and customers. However, some algorithms

are better at dealing with large volumes of data than others. Simi-

larly, some implementations scale better than others. In general, the

more data that needs to be processed, the more time it will take to

build a model; it also will likely require more computer memory and

disk space. One way to reduce the amount of time and resources

needed to build a model is to take a sample of the data. This is espe-

cially useful in the early phases of model building where a user

should get a feel for how a particular algorithm responds to the pro-

vided data. Building a model on 1 million customers may take min-

utes or hours depending on the technique. It would be better to get a

quick assessment of whether the mining technique yields any results

using a small sample than waiting for dubious results. Once the user

is convinced the technique and the data are appropriate, building a

model on more or all of the data can be pursued with greater expec-

tations for success. In other cases, a sample of the data may be all that

is required to produce a good model. For example, if you have a pop-

ulation of 10,000,000 customers, it may not be necessary to build a

clustering model on all 10,000,000 in order to segment your customer

base. A sample of even 50,000 may produce statistically sound

results.

When randomly sampling records, there is no guarantee that a

given attribute will contain all the possible values contained in that

attribute. This is particularly important when building classification

models. For the model to be able to predict a given target value, it

needs to have learned from data that contains examples of those val-

ues. In addition, a dataset skewed with too many of one category

may not allow a given algorithm to learn the desired pattern (i.e., the

negative signal drowns out the positive).

Recall the sampling technique called stratified sampling, intro-

duced in Section 2.1.4, which allows you to specify how many of

each target value to provide in the resulting data sample. Consider a

dataset with target attribute customer satisfaction . The goal is to pre-

dict a given customer's satisfaction level based on other customer

demographics and other customer experience metrics. If the values

are high, medium, and low, we should ensure we have a reasonable

number of each category. In Figure 3-6(a), we see a histogram of the

original data: high (4,654 cases), medium (130,954 cases), and low

(50,348 cases). We can then sample the data to ensure that we have a

Search WWH ::

Custom Search

Home