High Medium Low
(b) Straified Sample Case Count
(a) Case Count
Customer satisfaction data: histogram of target attribute.
relatively equal number of each category. Since we have few cases in
the high category, we will want to use all of those, and we may decide
to take 8,000 of the medium cases, and 6,000 of the low cases, as illus-
trated in Figure 3-6(b). The “correct” number of cases to specify is
more of an art than a science. Trying several variations can help iden-
tify an appropriate mix.
In several of the transformations above, we discussed replacing one
value with another. In its most general form, this is called recoding .
The binning transformation on categorical data mentioned above
(the 50 United States into regions) is also a form of recoding that
enables manual roll-up of data. Typically, recoding is performed on
categorical data (e.g., replacing the attribute values H , high , hi , and
“ ***” with HIGH ). This can be useful for cleaning data, as in the pre-
vious example, or to help in the interpretability of a model. For
example, when looking at the rules produced from market basket
analysis, it is much easier to understand a rule like “BEER implies
PIZZA” than one like “Prod-3425 implies Prod-5593.” Recoding can
also be useful for numerical data when non-numerical data is mixed
in with numerical data. For example, it is not uncommon to see “999”
or “ “ used for a missing age value. These may be more appropriately
be replaced with null.
An important step in data preparation often involves integrating two
or more datasets into one. If the data contain well-defined keys and
use the same data conventions, integrating the data can be as simple
as performing a database join on the two tables, producing a new
table or view. However, real data is seldom so clean and requires var-
ious data cleaning techniques as noted in the previous section about
errors and outliers.