Database Reference
In-Depth Information
any technique that requires knowledge of other user records becomes infeasible, and
therefore the B2B approaches cannot be applied here.
The bulk of the work in privacy-preserving data mining of association rules has
addressed the B2C environment (e.g. [ 7 , 6 , 17 , 45 ]), where the user's true data have
to be anonymized at the source itself. Note that the anonymization process has to be
implemented by a program which could be supplied either by the service provider
or, more likely, by an independent trusted third-party vendor. Further, this program
has to be verifiably secure—therefore, it must be simple in construction, eliminating
the possibility of the true data being surreptitiously supplied to the service provider.
In a nutshell, the goal of these techniques is to ensure the privacy of the raw local
data at the source, but, at the same time, to support accurate reconstruction of the
global data mining models at the destination.
Within the above framework, the general approach has been to adopt a data
perturbation strategy, wherein each individual user's true data are altered in some
manner before being forwarded to the service provider. Here, there are two possibili-
ties: statistical distortion , which has been the predominant technique, and algebraic
distortion , proposed in [ 57 ]. In the statistical approach, a common randomizing al-
gorithm is employed at all user sites, and this algorithm is subsequently disclosed to
the eventual data miner. For example, in the MASK technique [ 45 ], which is targeted
towards “market-basket” type of sparse boolean databases, each bit in the true user
transaction vector is independently flipped with a parametrized probability.
While there is only one-way communication from users to the service provider in
the statistical approach, the algebraic scheme, in marked contrast, requires two-way
communication between the data miner and the user. Here, the data miner supplies
a user-specific perturbation vector, and the user then returns the perturbed data after
applying this vector on the true data, discretizing the output and adding some noise.
The vector is dependent on the current contents of the perturbed database available
with the miner and, for large enterprises, the data collection process itself could
become a bottleneck in the efficient running of the system.
Within the statistical approach, there are two further possibilities: (a) a simple
independent attribute perturbation , wherein the value of each attribute in the user
record is perturbed independently of the rest; or (b) a more generalized dependent
attribute perturbation , where the perturbation of each attribute may be affected by the
perturbations of the other attributes in the record. Most of the statistical perturbation
techniques in the literature, including [ 18 , 17 , 45 ], fall into the independent attribute
perturbation category. Notice, however, that this is in a sense antithetical to the
original goal of association rule mining, which is to identify correlations across
attributes . This limitation is addressed in [ 4 ], which employs a dependent attribute
perturbation model, with each attribute in the user's data vector being perturbed
based on its own value, as well as the perturbed values of the earlier attributes.
Another model of privacy-preserving data mining is the k -anonymity model
[ 7 , 46 ], where each record value is replaced with a corresponding generalized value.
Specifically, each perturbed record cannot be distinguished from at least k
1 other
records in the data. However, this falls into the B2C model since the intermediate
database-forming-server can learn, or recover, precise records.
Search WWH ::




Custom Search