Database Reference
In-Depth Information
16.2.2 Optimal Solutions in the Hybrid Methodology
Having identified the size Q of database D X , the next step is to properly construct
these transactions to facilitate knowledge hiding. Since the actual values of all the
items in the database extension are unknown at this point, the hiding algorithm
represents them with binary variables that will be instantiated later on in the process.
In what follows, let u qm be the binary variable corresponding to the m-th item of
transaction T q 2D X (q2[1; Q]; m2[1; M]), when D O is in the sanitization process.
Under this formulation, the goal of the hybrid hiding algorithm [26] becomes to
optimally adjust all the binary variables involved in all the transactions of D X in
order to hide the sensitive itemsets, while minimally affecting the nonsensitive ones
in a way that they remain frequent in the sanitized outcome. As already presented in
Chapter 2, this is the notion of an exact hiding solution.
In a typical hiding scenario, distinct feasible solutions are of different quality.
Thus, an optimization criterion needs to be incorporated into the hiding strategy
to guide the algorithm to the best possible among all the feasible solutions. The
metric of distance, introduced in [23], is also applied by the hybrid methodology
to quantify the notion of “harm” caused to the original dataset by the sanitization
process. In the context of the hybrid hiding algorithm, the distance between database
D O and its sanitized version D is measured based on the extension D X as follows:
å
q2[1;Q];m2[1;M]
dist(D O ;D) =
u qm
(16.2)
As one can observe, the minimum impact of D can be quantified as the minimum
distance between D O and D. Thus, the objective of the hiding algorithm becomes to
appropriately set the u qm variables such that all the sensitive knowledge is hidden,
while the distance is minimized. An interesting property of the distance measure is
that it allows the hiding algorithm to ensure high quality in the sanitized database D
and to identify the optimal solution, if one exists. The notion of an optimal solution,
in the context of the hybrid algorithm of [26], is presented in Definition 16.2. Based
on the notion of distance and the size of the produced extension D X of database
D O , the database quality can be defined as follows:
Definition 16.1. (Database quality) Given the sanitized database D, its original
version D O and the produced extension D X , the quality of database D is measured
both in the size of D X and in the number of binary variables set to '1' in the trans-
actions of D X (i.e., the distance metric). In both cases, lower values correspond to
better solutions in terms of quality.
Through (16.1) the hybrid hiding algorithm is capable of identifying the lower
bound in the size of D X that is necessary to accommodate the hiding of the sensi-
tive knowledge in D O . Furthermore, through the use of (16.2) as an optimization
criterion, the hybrid hiding approach is guaranteed to identify a feasible solution
having minimum impact on database D. Given the previous definitions, we proceed
to define the notion of an optimal solution in the context of the hybrid algorithm.
Search WWH ::




Custom Search