that not all data mining methods can deal with all types of data, and
missing values are a notorious example of a reality with which many
algorithms have difficulties to deal with. Missing value imputation
techniques circumvent this problem by completing the missing field and
filling in an appropriate substitute value. Ideally, the imputed values
should be such that they do not disturb the overall distribution of the data
in a significant way, such that the final outcome of the data mining
process is not affected by the imputed values.
Dimensionality reduction : Often the attributes in a dataset are highly
inter-correlated and redundant. Consider for example a dataset to learn to
distinguish spam email from regular mail. Suppose that the dataset contains
for every mail, and for every single word that appeared in any of the mails,
whether or not it appears in that mail, and if so, how many times. Such a
dataset would have a tremendous dimensionality leading to very high
running times and very complex models which will be difficult to interpret
for a specialist. Dimensionality reduction techniques deal with this problem
by applying transformations of the data into a lower dimensional space.
Objects close to each other in the lower dimension are also close in the
high dimensional space, and vice versa. In the spam emails, one dimension
in the reduced space could be if the mail contains a lot of “medicine-
related” words, such as “Viagra”, “aspirin”, “pain”, etc.
Feature extraction and construction : A last type of preprocessing
technique is feature extraction; the process of making new features or
attributes from combinations of other attributes already present in the
dataset. An example would be to transform an attribute date-of-birth to an
attribute age, which could be much more informative for the learning
algorithm, or to combine two attributes height and weight to create a new
one, the body-mass index.
2.5.2 Database Coupling
Database coupling may enhance the possibilities of data mining. When the
underlying database is larger, more relations may be found than in separate
databases. Figure 2.5 illustrates this, showing two very small databases. For large
databases, the coupling of two databases may result in twice as many (dual)
relationships as when the databases are not coupled. 25 This form of database
25 For the mathematicians, two separate databases of size n can make up 2
relations, whereas the coupled database of size 2n can make up
dual relations. For
n→∞, the quotient in the number of dual relations can be calculated using basic
mathematics and results in a factor of 2.