Database Reference
In-Depth Information
being chosen from [1, K m ], rather than from [1, m ]. If it turns out that j>m , then j is
set to m (which means that the entire original transaction is copied to the randomized
transaction). Apart from the cutoff threshold, another difference between C & P and
SaS is that the subsequent ρ m randomized insertion (Step 3 above) is carried out
on (a) the items that are not present in the true transaction (as in SaS), and (b)
additionally, on the remaining items in the true transaction that were not selected
for inclusion in Step 2.
An issue in the C&P operator is the optimal selection of the ρ m and K m parameters,
and combinatorial formulae for determining their values are given in [ 17 ]. Through
a detailed set of experiments on real-life datasets, it was shown that even with a
challenging privacy requirement of not permitting any breaches with ρ> 50 %,
mining a C&P-randomized database was able to correctly identify around 80 to 90 %
of the “short” frequent itemsets, that is frequent itemsets of lengths up to 3. The issue
of how to safely randomize and mine long transactions was left as an open problem,
since directly using C&P in such environments could result in unacceptably poor
accuracy.
The above work was significantly extended in [ 18 ] through, as discussed
in Sect.2.1.0, the formulation of strict amplification-based privacy metrics, and
delineation of a methodology for limiting the associated privacy breaches.
Distributed Databases Maintaining input data privacy was also considered in
[ 26 , 52 ] in the context of databases that are distributed across a number of sites,
with each site only willing to share data mining results, but not the source data.
While [ 52 ] considered data that is vertically partitioned (i.e., each site hosts a dis-
joint subset of the matrix columns), the complementary situation where the data is
horizontally partitioned (i.e., each site hosts a disjoint subset of the matrix rows) is
addressed in [ 26 ]. The solution technique in [ 52 ] requires generating and computing
a large set of independent linear equations—in fact, the number of equations and the
number of terms in each equation is proportional to the cardinality of the database.
It may therefore prove to be expensive for market-basket databases which typically
contain millions of customer transactions. In [ 26 ], on the other hand, the problem is
modeled as a secure multi-party computation [ 24 ] and an algorithm that minimizes
the information shared without incurring much overhead on the mining process is
presented. Note that in these formulations, a pre-existing true database at each site
is assumed, i.e., a B2B model.
Algebraic Distortion Zhang et al., in [ 57 ], presented an algebraic-distortion mech-
anism that unlike the statistical approach of the prior literature, requires two-way
communication between the miner and the users. If V c is the current perturbed
database, then E k is computed by the miner, which corresponds to the eigenvec-
tors corresponding to the largest k eigenvalues of V c T V c , where V c T is the transpose
of V c . The choice of k makes a tradeoff between privacy and accuracy - large values
of k give more accuracy and less privacy, while small values provide higher privacy
and less accuracy. E k is supplied to the user, who then uses it on her true transaction
vector, discretizes the output, and then adds a noise component.
Search WWH ::




Custom Search