Java Reference
In-Depth Information
JDM addresses aspects of this first phase by providing a framework
for thinking about data mining problems in terms of mining functions:
the inputs they require and the outputs they produce. The business
problem itself requires domain-specific knowledge and creativity to
decide what should be done. JDM, being an application programming
interface (API), enables the specification of how the solution will be
implemented, and through the use of settings and data objects JDM can
assist in the capture of some outputs from the business understanding
phase.
3.1.2
Data Understanding Phase
Once we understand the problem and expected results, we need to
determine what data is available, its quality, and appropriateness for
solving the stated business problem. This is covered in the data
understanding phase. Often, once the data is better understood, the
problem may need to be refined, or even redefined. Important data
may be missing or corrupt; such data is referred to as dirty . This may
result in new requirements to clean the data or to obtain new data,
or different types of data, with careful attention paid to accuracy or
completeness.
With data understanding, we strive to gain insights into the data
through basic and possibly advanced statistical methods. For exam-
ple, we need to understand the range of values in each attribute as
well as frequency counts of values, often referred to as the data distri-
bution . Continuous attributes, like age and income , may be bucketized
(or binned) to provide a better sense of the overall distribution. Fre-
quency counts provide insight into the existence of extreme values,
called outliers , that can adversely affect data mining results. We also
need to assess how the data should be interpreted; for example,
should a number attribute be treated as a continuous value, like age or
income , or a discrete value, perhaps movie rating or multiple choice
survey question response? Some data mining tools will automatically
address issues such as outliers and missing values, as well as provide
heuristics for guessing how the data should be interpreted.
In many situations, data may be coming from multiple sources
that need to be integrated before further analysis is possible. It is at
this point that data inconsistencies may be most pronounced because
the joining of data tables may be hindered if keys are not properly
maintained. For example, joining two tables based on customer name
may prove impossible if names such as “John Smith” are common, or
Search WWH ::




Custom Search