Privacy Issues in Association Rule Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

the rule itself is possible, thus it is assigned an RF value of 1; otherwise RF = 0.

However, this measure is not certain since, for instance, an adversary may not

learn an itemset despite knowing its subsets.

Bertino et al. [ 12 ] propose a set of measures that are directly related to the per-

formance of a hiding algorithm as far as external parameters are concerned. These

process performance measures are clustered into four categories, as follows:

(a) Efficiency. This category consists of measures that quantify the ability of a

privacy preserving algorithm to efficiently use the available resources and execute

with good performance. Efficiency is measured in terms of CPU-time, space

requirements (related to the memory usage and the required storage capacity)

and communication requirements.

(b) Scalability. This category consists of measures that evaluate how effectively the

privacy preserving technique handles increasing sizes of the data from which

information needs to be mined and privacy needs to be ensured. Scalability

is measured based on the decrease in the performance of the algorithm or the

increase of the storage requirements along with the communications cost (if in

a distributed setting), when the algorithm is provided with larger datasets.

(c) Data Quality. The data quality of a privacy preservation algorithm depends on

two parameters. There are the quality of the dataset after the sanitization process,

and the quality of the data mining results when applied to this dataset, compared

to the ones attained when using the original dataset. Among the various possible

measures for the quantification of the data quality, the most preferable are: (i)

accuracy , which measures the proximity of a sanitized value to the original one

and is closely related to the information loss resulting from the hiding strategy,

(ii) completeness , which is used to evaluate the degree of missed data in the

sanitized database and (iii) consistency , which is related to the relationships

that must continue to hold among the different fields of a data item or among

data items in a sanitized database. Examples of data quality measures are Diss

(presented earlier) and Kullback-Leibler (KL) divergence.

(d) Privacy Level. This category consists of measures that estimate the degree of

uncertainty according to which, the protected information can still be predicted.

Measures, such as the information entropy, the level of privacy and the J -measure

[ 12 ], are some among the possible metrics that one can apply to quantify the

privacy level attained by a hiding scheme.

4

Cryptographic Methods

Over the years, many data mining protocols have been designed to mine distributed

data that reside in different data warehouses. In those protocols, data are generally

assumed to be either vertically or horizontally partitioned. Table 15.1 shows a trivial

example of two different data partitioning schemes for a simple transaction (binary)

dataset U , consisting of four attributes.

Search WWH ::

Custom Search

Home