Privacy-Preserving Data Mining Techniques: Survey and Challenges - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

An ultimate method for protection against attribute disclosure is based on the

idea that the original data is replaced, in its entirety, by a synthetic data set with

the same statistical properties (e.g. mean, variance, etc.) as the ones of the original

dataset. (Krishnamurty Muralidhar and Sarathy 2008) present a method which, be-

sides preserving the mean vector and the covariance matrix, also guarantees

similarity of the synthetic confidential values to the original confidential values.

This somewhat radical approach may encounter some resistance in applications in

which veracity of the data is important, e.g. in medical research. On the other

hand, it may be acceptable in areas where use of the aggregated data is already a

norm, e.g. in large-scale social sciences research.

A number of attribute disclosure attacks, and methods to protect against them,

have been described in the literature. We can mention here (Loukides, Gkoulalas-

Divanis et al. 2011), (Martin, Kifer et al. 2007), (Chen, LeFevre et al. 2007). The

generality of these attacks is questionable and leads to high-granularity privacy

protection approaches in which multiple transformations are applied to the data,

resulting in potentially significant decrease in data quality while still leaving the

resulting data vulnerable to privacy attacks of novel kind, which are not yet known

or described in the literature. This is analogous to multi-layered anti-virus patches,

which themselves may open vulnerabilities to novel, yet unknown viruses to come

in the future.

11.4 Privacy of Decentralized Data

As described in sec. 1, we address here an important scenario in which the owner-

ship structure of data in T is shared among multiple parties in order to obtain a

meaningful data mining result of interest to all parties. This is a frequent pheno-

menon, as groups of users may be interested in performing data mining on the un-

ion of their data, but cannot share the data for legal or commercial (competitive)

reasons. We are then talking about the data being partitioned . As shown in Fig. 3,

the partitioning may be either vertical or horizontal. In the vertical partitioning, all

the parties have data referring the same instances, but each party will have a dif-

ferent subset of attributes describing the instances. An example of such a situation

is a scenario in which one wants to perform an extensive association rule mining

on a dataset describing vehicles involved in certain types of accidents. Data

(attributes) pertaining to performance of different subcomponents (tires, engine,

brakes) will belong to different manufacturers who do not want to share it with

others, but are interested in the results. In the horizontal scenario, different parties

have different subsets of instances, but they all have the same attributes. An ex-

ample of such situation is a medical study performed jointly by a number of hos-

pitals. Each of the hospitals may have its own limited set of patients participating

in the study, but results drawn from the much larger union of all the data from

different hospitals will achieve a much higher level of credibility. Finally, mixed

horizontal-vertical scenarios are also possible.

Search WWH ::

Custom Search

Home