Data Dilemmas in the Information Society: Introduction and Overview - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

Step 3: Data Mining

The third step is the actual data-mining stage, in which the data are analyzed in

order to find patterns or relations. This is done using mathematical algorithms.

Data mining is different from traditional database techniques or statistical methods

because what is being looked for does not necessarily have to be known. Thus,

data mining may be used to discover new patterns or to confirm suspected

relationships. The former is called a 'bottom-up' or 'data-driven' approach,

because it starts with the data and then theories based on the discovered patterns

are built. The latter is called a 'top-down' or 'theory-driven' approach, because it

starts with a hypothesis and then the data is checked to determine whether it is

consistent with the hypothesis. 12

There are many different data-mining techniques. The most common types of

discovery algorithms with regard to group profiling are clustering, classification,

and, to some extent, regression. Clustering is used to describe data by forming

groups with similar properties; classification is used to map data into several

predefined classes; and regression is used to describe data with a mathematical

function. Chapter 2 will elaborate on the data mining techniques.

In data mining, a pattern is a statement that describes relationships in a (sub)set

of data such that the statement is simpler than the enumeration of all the facts in

the (sub)set of data. When a pattern in data is interesting and certain enough for a

user, according to the user's criteria, it is referred to as knowledge . 13 Patterns are

interesting when they are novel (which depends on the user's knowledge), useful

(which depends on the user's goal), and nontrivial to compute (which depends on

the user's means of discovering patterns, such as the available data and the

available people and/or technologies to process the data). For a pattern to be

considered knowledge, a particular certainty is also required. A pattern is not

likely to be true across all the data. This makes it necessary to express the

certainty of the pattern. Certainty may involve several factors, such as the integrity

of the data and the size of the sample.

Step 4: Interpretation

Step 4 in the KDD process is the interpretation of the results of the data-mining

step. The results, mostly statistical, must be transformed into understandable

information, such as graphs, tables, or causal relations. The resulting information

may not be considered knowledge by the user: many relations and patterns that are

found may not be useful in a specific context. A selection may be made of useful

information. What information is selected, depends on the questions set forth by

those performing the KDD process.

An important phenomenon that may be mentioned in this context is masking .

When particular characteristics are found to be correlated, it may be possible to

use trivial characteristics as indicators of sensitive characteristics. An example or

this is indirect discrimination using redlining. Originally redlining is the practice

12 SPSS Inc. (1999), p. 6.

13 Adriaans, P. and Zantinge, D. (1996), p. 135.

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home