Database Reference
In-Depth Information
One major problem when coupling different databases is that not all databases
would use the same primary keys to identify objects. For example, suppose that
we want to combine the student database of example above with a database of the
financial department registering which students paid their tuition fees. It could be
the case that in the dataset from the financial department the Social Security
Number of the students is used to identify them, and the student number is not
recorded. In such a situation, when we want to link both databases, we could only
rely on common attributes in both datasets, such as first name and second name,
and maybe the data of birth. Add now some misspellings or different conventions
on how to treat composite names such as “Van Hee” versus “Hee, Van” to the
mix, and linking the two databases may become a far from trivial problem.
Resolving such linking problems is often called entity resolution and it often
requires disambiguation . 5
Therefore, often a first step in data analysis, the combination of different
datasets, is far from trivial and may require itself the application of data mining or
learning techniques.
2.4 Basic Techniques
In this section, several basic discovery algorithms are explained and the kinds of
group profiles that may result from them are discussed. We do not present a
detailed description, nor do we give an exhaustive enumeration of all methods.
Only the data-mining techniques that may be relevant to group profiling, namely,
classification, clustering and pattern mining are discussed. 6,7 Figure 2.2 illustrates
these types of discovery algorithms.
The purpose of pattern mining is to find patterns, for instance regression
patterns that describe data using a function. In Figure 2.2A, the data is represented
by a linear function. A typical example of a linear relation is the relation between
shoe size and tallness: taller persons have, in general, larger feet. And the taller the
person, the larger his or her feet will be. Clustering is used to describe data by
forming groups with similar properties. In Figure 2.2B, three different groups are
identified, marked by stars (*), open dots (o) and crosses (x). After identification,
descriptions of these groups may be found, indicated by the ellipses drawn. Note
that the groups may overlap. Classification is used to map data into several
predefined classes. In Figure 2.2C, a predefined class boundary is drawn (a non-
linear curve), creating two classes (one to the left of the curve and one to the right
of the curve). After the class boundary is defined, each data subject is classified
into one of the two classes. Once it is clear to which class each data subject
belongs, it is possible to attach labels, which is done by attaching crosses (x) and
open dots (o). 8
5
For more on this problem, see also Subsection 2.5.2 and Chapter 10.
6 Fayyad, U.M., Piatetsky-Shapiro, G. and Smyth, P. (1996a).
7
Fayyad, U.M., Piatetsky-Shapiro, G. and Smyth, P. (1996b).
8
Note that overlap is not possible in the case of classification.
Search WWH ::




Custom Search