Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

JDM addresses aspects of this first phase by providing a framework

for thinking about data mining problems in terms of mining functions:

the inputs they require and the outputs they produce. The business

problem itself requires domain-specific knowledge and creativity to

decide what should be done. JDM, being an application programming

interface (API), enables the specification of how the solution will be

implemented, and through the use of settings and data objects JDM can

assist in the capture of some outputs from the business understanding

phase.

3.1.2

Data Understanding Phase

Once we understand the problem and expected results, we need to

determine what data is available, its quality, and appropriateness for

solving the stated business problem. This is covered in the data

understanding phase. Often, once the data is better understood, the

problem may need to be refined, or even redefined. Important data

may be missing or corrupt; such data is referred to as dirty . This may

result in new requirements to clean the data or to obtain new data,

or different types of data, with careful attention paid to accuracy or

completeness.

With data understanding, we strive to gain insights into the data

through basic and possibly advanced statistical methods. For exam-

ple, we need to understand the range of values in each attribute as

well as frequency counts of values, often referred to as the data distri-

bution . Continuous attributes, like age and income , may be bucketized

(or binned) to provide a better sense of the overall distribution. Fre-

quency counts provide insight into the existence of extreme values,

called outliers , that can adversely affect data mining results. We also

need to assess how the data should be interpreted; for example,

should a number attribute be treated as a continuous value, like age or

income , or a discrete value, perhaps movie rating or multiple choice

survey question response? Some data mining tools will automatically

address issues such as outliers and missing values, as well as provide

heuristics for guessing how the data should be interpreted.

In many situations, data may be coming from multiple sources

that need to be integrated before further analysis is possible. It is at

this point that data inconsistencies may be most pronounced because

the joining of data tables may be hindered if keys are not properly

maintained. For example, joining two tables based on customer name

may prove impossible if names such as “John Smith” are common, or

Search WWH ::

Custom Search

Home