What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

that not all data mining methods can deal with all types of data, and

missing values are a notorious example of a reality with which many

algorithms have difficulties to deal with. Missing value imputation

techniques circumvent this problem by completing the missing field and

filling in an appropriate substitute value. Ideally, the imputed values

should be such that they do not disturb the overall distribution of the data

in a significant way, such that the final outcome of the data mining

process is not affected by the imputed values.

•

Dimensionality reduction : Often the attributes in a dataset are highly

inter-correlated and redundant. Consider for example a dataset to learn to

distinguish spam email from regular mail. Suppose that the dataset contains

for every mail, and for every single word that appeared in any of the mails,

whether or not it appears in that mail, and if so, how many times. Such a

dataset would have a tremendous dimensionality leading to very high

running times and very complex models which will be difficult to interpret

for a specialist. Dimensionality reduction techniques deal with this problem

by applying transformations of the data into a lower dimensional space.

Objects close to each other in the lower dimension are also close in the

high dimensional space, and vice versa. In the spam emails, one dimension

in the reduced space could be if the mail contains a lot of “medicine-

related” words, such as “Viagra”, “aspirin”, “pain”, etc.

•

Feature extraction and construction : A last type of preprocessing

technique is feature extraction; the process of making new features or

attributes from combinations of other attributes already present in the

dataset. An example would be to transform an attribute date-of-birth to an

attribute age, which could be much more informative for the learning

algorithm, or to combine two attributes height and weight to create a new

one, the body-mass index.

2.5.2 Database Coupling

Database coupling may enhance the possibilities of data mining. When the

underlying database is larger, more relations may be found than in separate

databases. Figure 2.5 illustrates this, showing two very small databases. For large

databases, the coupling of two databases may result in twice as many (dual)

relationships as when the databases are not coupled. 25 This form of database





n

25 For the mathematicians, two separate databases of size n can make up 2  

 

dual

2





2 n

relations, whereas the coupled database of size 2n can make up  

 

dual relations. For

2

n→∞, the quotient in the number of dual relations can be calculated using basic

mathematics and results in a factor of 2.

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home