Privacy Issues in Association Rule Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

any technique that requires knowledge of other user records becomes infeasible, and

therefore the B2B approaches cannot be applied here.

The bulk of the work in privacy-preserving data mining of association rules has

addressed the B2C environment (e.g. [ 7 , 6 , 17 , 45 ]), where the user's true data have

to be anonymized at the source itself. Note that the anonymization process has to be

implemented by a program which could be supplied either by the service provider

or, more likely, by an independent trusted third-party vendor. Further, this program

has to be verifiably secure—therefore, it must be simple in construction, eliminating

the possibility of the true data being surreptitiously supplied to the service provider.

In a nutshell, the goal of these techniques is to ensure the privacy of the raw local

data at the source, but, at the same time, to support accurate reconstruction of the

global data mining models at the destination.

Within the above framework, the general approach has been to adopt a data

perturbation strategy, wherein each individual user's true data are altered in some

manner before being forwarded to the service provider. Here, there are two possibili-

ties: statistical distortion , which has been the predominant technique, and algebraic

distortion , proposed in [ 57 ]. In the statistical approach, a common randomizing al-

gorithm is employed at all user sites, and this algorithm is subsequently disclosed to

the eventual data miner. For example, in the MASK technique [ 45 ], which is targeted

towards “market-basket” type of sparse boolean databases, each bit in the true user

transaction vector is independently flipped with a parametrized probability.

While there is only one-way communication from users to the service provider in

the statistical approach, the algebraic scheme, in marked contrast, requires two-way

communication between the data miner and the user. Here, the data miner supplies

a user-specific perturbation vector, and the user then returns the perturbed data after

applying this vector on the true data, discretizing the output and adding some noise.

The vector is dependent on the current contents of the perturbed database available

with the miner and, for large enterprises, the data collection process itself could

become a bottleneck in the efficient running of the system.

Within the statistical approach, there are two further possibilities: (a) a simple

independent attribute perturbation , wherein the value of each attribute in the user

record is perturbed independently of the rest; or (b) a more generalized dependent

attribute perturbation , where the perturbation of each attribute may be affected by the

perturbations of the other attributes in the record. Most of the statistical perturbation

techniques in the literature, including [ 18 , 17 , 45 ], fall into the independent attribute

perturbation category. Notice, however, that this is in a sense antithetical to the

original goal of association rule mining, which is to identify correlations across

attributes . This limitation is addressed in [ 4 ], which employs a dependent attribute

perturbation model, with each attribute in the user's data vector being perturbed

based on its own value, as well as the perturbed values of the earlier attributes.

Another model of privacy-preserving data mining is the k -anonymity model

[ 7 , 46 ], where each record value is replaced with a corresponding generalized value.

Specifically, each perturbed record cannot be distinguished from at least k

−

1 other

records in the data. However, this falls into the B2C model since the intermediate

database-forming-server can learn, or recover, precise records.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home