Fig. 1.1 Steps in the KDD process
Step 1: Data Collection
The first step in the KDD process is the collection of data. In the case of
information about individuals, this may be done explicitly, for instance, by asking
people for their personal data, or non-explicitly, for instance, by using databases
that already exist, albeit sometimes for other purposes. The information requested
usually consists of name, address and e-mail address. Depending on the purpose
for which the information will be used, additional information may be required,
such as credit card number, occupation, hobbies, date of birth, fields of interests,
medical data, etc.
It is very common to use inquiries to obtain information, which are often
mandatory in order to obtain a product, service, or price reduction. In this way, a
take-it-or-leave-it situation is created, in which there is often no choice for a
consumer but to fill in his personal data. 10 In most cases, the user is notified of the
fact that privacy regulations are applied to the data. However, research shows that
data collectors do not always keep this promise, especially in relation to
information obtained on the Internet. 11 The same research also shows that
customers are often not informed about the use that is made of the information,
and in general much more information is asked for than is needed, mainly because
it is thought that such data may be useful in the future.
Step 2: Data Preparation
In the second step of the KDD process, the data is prepared by rearranging and
ordering it. Sometimes, it is desirable that the data be aggregated. For instance, zip
codes may be aggregated into regions or provinces, ages may be aggregated into
five-year categories, or different forms of cancer may be aggregated into one
disease group. In this stage, a selection is often made of the data that may be
useful to answer the questions set forth. But in some cases, it may be more
efficient to make such a selection even earlier, in the data collection phase. The
type of data and the structure and dimension of the database determine the range
of data-mining tools that may be applied. This may be taken into account in
selecting which of the available data will be used for data mining.
10 These take-it-or-leave-it options are sometimes referred to as conditional offers .
11 Artz, M.J.T. and Eijk, M.M.M. van (2000).