Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

if customers are identified with name and address and addresses are

not entered correctly or consistently across data sources. Having

common unique identifiers greatly simplifies this effort.

JDM supports data understanding through its statistics interface.

Users can compute statistics such as mean, median, standard devia-

tion, etc., as well as frequency counts on all attributes in a dataset.

Since these statistics are collected on individual attributes, they are

referred to as univariate statistics. These statistics can be inspected

directly as numerical values provided through the API, or through a

vendor tool-provided graphical interface. JDM 2.0 [JSR-247], as dis-

cussed in Chapter 18, further extends the statistics interface to

include multivariate statistics (i.e., those involving two or more

attributes). JDM specifies data to be presented as a single table; how-

ever, vendors are free to extend this capability to support multiple

tables, online analytical processing (OLAP) cubes, or nested tables.

3.1.3

Data Preparation Phase

Once the problem is defined and we believe there is reasonable data

to support solving that problem, we enter the data preparation phase.

In this phase, one goal is to produce one or more datasets suitable for

mining from the raw data identified in the data understanding phase.

Through an iterative approach, many such datasets may need to be

produced with various refinements to achieve the desired model

quality. Data transformations within data preparation can be as

simple as ensuring that similar values are coded the same way (e.g.,

”married,” “Married,” and “M” are all converted to “married” so

they are considered the same value). This type of data cleaning is

essential to avoid poor results such as inaccurate predictions or

meaningless clusters. As noted in Chapter 1, the adage “garbage in,

garbage out” is no more fitting than in data mining.

At the other end of the spectrum, data preparation may involve

computing missing values or deriving new attributes from others, for

example, defining a new target attribute called ATTRITER defined as

“yes” if 50 percent or more of a customer's accounts have been closed

within the past month, and “no” otherwise. The amount of data pre-

paration can vary from virtually none, where data mining tools per-

form automated data preparation, to extremely elaborate preparation

involving complex or creative transformations. The effort required by

the data preparation phase can often dwarf the effort required by the

other phases depending on how dirty or primitive the data is. We

expand our discussion of data preparation in Section 3.2.

Search WWH ::

Custom Search

Home