Database Reference
In-Depth Information
Fig. 1.4. Relative effort spent on each of the DMKD steps.
between data preparation and data mining, which he called data audit. The six-step
model has the advantage of being similar to the CRISP-DM model that was
validated on large business applications. The model has also been used in several
projects like a system for diagnoses of SPECT bulleye images [18], creating and
mining a database of cardiac SPECT images [61], creating an automated
diagnostic system for cardiac SPECT images [44], and mining clinical information
concerning cystic fibrosis patients [47].
The important characteristic of the DMKD process is relative time spent on
completing each of the steps. Reference [16] estimates that about 20% of the effort
is spent on business objective determination, about 60% on data preparation, and
about 10% for data mining and analysis of knowledge and knowledge assimilation
steps, respectively. On the other hand, [10] shows that about 15 to 25% of the
project time is spent on the DM step. Usually it is assumed that about 50% of the
project effort is spent on data preparation. There are several reasons why this step
requires so much time: data collected by enterprise companies consist of about 1
to 5% errors, often the data are redundant (especially across databases) and
inconsistent, also companies may not collect all the necessary data [57]. These
serious data quality problems contribute to the extensive requirements for data
preprocessing step. In a study at a Canadian fast-food company [39], it was shown
that the DM step took about 45% of the total project effort, while data preparation
took only about 30%. Thus, it is better to use time ranges rather than fixed times
for estimating the steps requirements, see Fig. 1.4.
A very important issue is how to carry out the DMKD process without
extensive background knowledge, without manual data manipulation, and without
manual procedures to exchange data between different DM applications. The next
two sections describe technologies that may help in automating the DMKD
process, thus making its implementation easier.
1.3 New Technologies
Automating or, more realistically, semiautomating the DMKD process is a very
complex task. User input is always necessary to perform the entire DMKD task
because only domain experts have the necessary knowledge about the domain and
data. In addition, evaluation of the results, at each DMKD step, is needed. To
Search WWH ::




Custom Search