Step 3: Data Mining
The third step is the actual data-mining stage, in which the data are analyzed in
order to find patterns or relations. This is done using mathematical algorithms.
Data mining is different from traditional database techniques or statistical methods
because what is being looked for does not necessarily have to be known. Thus,
data mining may be used to discover new patterns or to confirm suspected
relationships. The former is called a 'bottom-up' or 'data-driven' approach,
because it starts with the data and then theories based on the discovered patterns
are built. The latter is called a 'top-down' or 'theory-driven' approach, because it
starts with a hypothesis and then the data is checked to determine whether it is
consistent with the hypothesis. 12
There are many different data-mining techniques. The most common types of
discovery algorithms with regard to group profiling are clustering, classification,
and, to some extent, regression. Clustering is used to describe data by forming
groups with similar properties; classification is used to map data into several
predefined classes; and regression is used to describe data with a mathematical
function. Chapter 2 will elaborate on the data mining techniques.
In data mining, a pattern is a statement that describes relationships in a (sub)set
of data such that the statement is simpler than the enumeration of all the facts in
the (sub)set of data. When a pattern in data is interesting and certain enough for a
user, according to the user's criteria, it is referred to as knowledge . 13 Patterns are
interesting when they are novel (which depends on the user's knowledge), useful
(which depends on the user's goal), and nontrivial to compute (which depends on
the user's means of discovering patterns, such as the available data and the
available people and/or technologies to process the data). For a pattern to be
considered knowledge, a particular certainty is also required. A pattern is not
likely to be true across all the data. This makes it necessary to express the
certainty of the pattern. Certainty may involve several factors, such as the integrity
of the data and the size of the sample.
Step 4: Interpretation
Step 4 in the KDD process is the interpretation of the results of the data-mining
step. The results, mostly statistical, must be transformed into understandable
information, such as graphs, tables, or causal relations. The resulting information
may not be considered knowledge by the user: many relations and patterns that are
found may not be useful in a specific context. A selection may be made of useful
information. What information is selected, depends on the questions set forth by
those performing the KDD process.
An important phenomenon that may be mentioned in this context is masking .
When particular characteristics are found to be correlated, it may be possible to
use trivial characteristics as indicators of sensitive characteristics. An example or
this is indirect discrimination using redlining. Originally redlining is the practice
12 SPSS Inc. (1999), p. 6.
13 Adriaans, P. and Zantinge, D. (1996), p. 135.