Privacy Issues in Association Rule Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

being chosen from [1, K m ], rather than from [1, m ]. If it turns out that j>m , then j is

set to m (which means that the entire original transaction is copied to the randomized

transaction). Apart from the cutoff threshold, another difference between C & P and

SaS is that the subsequent ρ m randomized insertion (Step 3 above) is carried out

on (a) the items that are not present in the true transaction (as in SaS), and (b)

additionally, on the remaining items in the true transaction that were not selected

for inclusion in Step 2.

An issue in the C&P operator is the optimal selection of the ρ m and K m parameters,

and combinatorial formulae for determining their values are given in [ 17 ]. Through

a detailed set of experiments on real-life datasets, it was shown that even with a

challenging privacy requirement of not permitting any breaches with ρ> 50 %,

mining a C&P-randomized database was able to correctly identify around 80 to 90 %

of the “short” frequent itemsets, that is frequent itemsets of lengths up to 3. The issue

of how to safely randomize and mine long transactions was left as an open problem,

since directly using C&P in such environments could result in unacceptably poor

accuracy.

The above work was significantly extended in [ 18 ] through, as discussed

in Sect.2.1.0, the formulation of strict amplification-based privacy metrics, and

delineation of a methodology for limiting the associated privacy breaches.

Distributed Databases Maintaining input data privacy was also considered in

[ 26 , 52 ] in the context of databases that are distributed across a number of sites,

with each site only willing to share data mining results, but not the source data.

While [ 52 ] considered data that is vertically partitioned (i.e., each site hosts a dis-

joint subset of the matrix columns), the complementary situation where the data is

horizontally partitioned (i.e., each site hosts a disjoint subset of the matrix rows) is

addressed in [ 26 ]. The solution technique in [ 52 ] requires generating and computing

a large set of independent linear equations—in fact, the number of equations and the

number of terms in each equation is proportional to the cardinality of the database.

It may therefore prove to be expensive for market-basket databases which typically

contain millions of customer transactions. In [ 26 ], on the other hand, the problem is

modeled as a secure multi-party computation [ 24 ] and an algorithm that minimizes

the information shared without incurring much overhead on the mining process is

presented. Note that in these formulations, a pre-existing true database at each site

is assumed, i.e., a B2B model.

Algebraic Distortion Zhang et al., in [ 57 ], presented an algebraic-distortion mech-

anism that unlike the statistical approach of the prior literature, requires two-way

communication between the miner and the users. If V c is the current perturbed

database, then E k is computed by the miner, which corresponds to the eigenvec-

tors corresponding to the largest k eigenvalues of V c T V c , where V c T is the transpose

of V c . The choice of k makes a tradeoff between privacy and accuracy - large values

of k give more accuracy and less privacy, while small values provide higher privacy

and less accuracy. E k is supplied to the user, who then uses it on her true transaction

vector, discretizes the output, and then adds a noise component.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home