Java Reference
In-Depth Information
Sampling
Companies are amassing huge volumes of data on everything from
manufacturing processes and product defects to maintaining a 360-
degree view of operations and customers. However, some algorithms
are better at dealing with large volumes of data than others. Simi-
larly, some implementations scale better than others. In general, the
more data that needs to be processed, the more time it will take to
build a model; it also will likely require more computer memory and
disk space. One way to reduce the amount of time and resources
needed to build a model is to take a sample of the data. This is espe-
cially useful in the early phases of model building where a user
should get a feel for how a particular algorithm responds to the pro-
vided data. Building a model on 1 million customers may take min-
utes or hours depending on the technique. It would be better to get a
quick assessment of whether the mining technique yields any results
using a small sample than waiting for dubious results. Once the user
is convinced the technique and the data are appropriate, building a
model on more or all of the data can be pursued with greater expec-
tations for success. In other cases, a sample of the data may be all that
is required to produce a good model. For example, if you have a pop-
ulation of 10,000,000 customers, it may not be necessary to build a
clustering model on all 10,000,000 in order to segment your customer
base. A sample of even 50,000 may produce statistically sound
results.
When randomly sampling records, there is no guarantee that a
given attribute will contain all the possible values contained in that
attribute. This is particularly important when building classification
models. For the model to be able to predict a given target value, it
needs to have learned from data that contains examples of those val-
ues. In addition, a dataset skewed with too many of one category
may not allow a given algorithm to learn the desired pattern (i.e., the
negative signal drowns out the positive).
Recall the sampling technique called stratified sampling, intro-
duced in Section 2.1.4, which allows you to specify how many of
each target value to provide in the resulting data sample. Consider a
dataset with target attribute customer satisfaction . The goal is to pre-
dict a given customer's satisfaction level based on other customer
demographics and other customer experience metrics. If the values
are high, medium, and low, we should ensure we have a reasonable
number of each category. In Figure 3-6(a), we see a histogram of the
original data: high (4,654 cases), medium (130,954 cases), and low
(50,348 cases). We can then sample the data to ensure that we have a
Search WWH ::




Custom Search