if customers are identified with name and address and addresses are
not entered correctly or consistently across data sources. Having
common unique identifiers greatly simplifies this effort.
JDM supports data understanding through its statistics interface.
Users can compute statistics such as mean, median, standard devia-
tion, etc., as well as frequency counts on all attributes in a dataset.
Since these statistics are collected on individual attributes, they are
referred to as univariate statistics. These statistics can be inspected
directly as numerical values provided through the API, or through a
vendor tool-provided graphical interface. JDM 2.0 [JSR-247], as dis-
cussed in Chapter 18, further extends the statistics interface to
include multivariate statistics (i.e., those involving two or more
attributes). JDM specifies data to be presented as a single table; how-
ever, vendors are free to extend this capability to support multiple
tables, online analytical processing (OLAP) cubes, or nested tables.
Data Preparation Phase
Once the problem is defined and we believe there is reasonable data
to support solving that problem, we enter the data preparation phase.
In this phase, one goal is to produce one or more datasets suitable for
mining from the raw data identified in the data understanding phase.
Through an iterative approach, many such datasets may need to be
produced with various refinements to achieve the desired model
quality. Data transformations within data preparation can be as
simple as ensuring that similar values are coded the same way (e.g.,
”married,” “Married,” and “M” are all converted to “married” so
they are considered the same value). This type of data cleaning is
essential to avoid poor results such as inaccurate predictions or
meaningless clusters. As noted in Chapter 1, the adage “garbage in,
garbage out” is no more fitting than in data mining.
At the other end of the spectrum, data preparation may involve
computing missing values or deriving new attributes from others, for
example, defining a new target attribute called ATTRITER defined as
“yes” if 50 percent or more of a customer's accounts have been closed
within the past month, and “no” otherwise. The amount of data pre-
paration can vary from virtually none, where data mining tools per-
form automated data preparation, to extremely elaborate preparation
involving complex or creative transformations. The effort required by
the data preparation phase can often dwarf the effort required by the
other phases depending on how dirty or primitive the data is. We
expand our discussion of data preparation in Section 3.2.