Database Reference
In-Depth Information
14.3.1 Datasets
All real datasets that were used to evaluate the inline algorithm are publicly available
through the FIMI repository (http://fimi.cs.helsinki.fi/). Datasets BMS-WebView-1
and BMS-WebView-2 contain click stream data from the Blue Martini Software,
Inc. and were used for the KDD Cup 2000 [41]. The mushroom dataset was prepared
by Roberto Bayardo (University of California, Irvine) [11]. These datasets demon-
strate varying characteristics in terms of the number of transactions and items and
the average transaction lengths. Table 14.4 summarizes them.
The primary bottleneck of the inline approach was the time taken to run the fre-
quent itemset mining algorithm as well as the time needed to solve the produced
CSPs. The thresholds of minimum support were properly selected to ensure a large
amount of frequent itemsets. Among these itemsets, some itemsets were randomly
selected and characterized as sensitive. Several experiments were conducted to hide
1, 2, 4 and 5-sensitive itemsets. Since the sanitization methodology of the inline al-
gorithm is item-based, the higher the number of items in a sensitive itemset, the more
the u nm variables involved and therefore the more constraints need to be solved.
To avoid the explosion in the number of constraints two pruning techniques were
enforced. Firstly, all tautologies, i.e. inequalities that always hold, were removed.
Secondly, the remaining inequalities were partitioned into sets, the overlapping sets
were identified and only the most specific inequality was kept from each overlap-
ping set. Both pruning techniques can be easily applied and relieve the solver from
a good portion of unnecessary work. All experiments were conducted on a PC run-
ning Linux on an Intel Pentium D, 3.2 Ghz processor. The integer programs were
solved by using ILOG CPLEX 9.0 [36].
14.3.2 Evaluation Methodology
The inline algorithm was evaluated using the metric of distance between databases
D O and D. The lower the distance, the better the quality of the hiding solution.
Since even in non-exact solutions the sensitive itemsets are successfully hidden, it
is important to notice that in any case the sanitized databaseD can be safely released
to untrusted third parties as it effectively protects all the sensitive knowledge.
As a final comment on the evaluation procedure that was employed in [23], we
should mention that it is not sensible to compare the attained experimental results
against the widely known metric of accuracy (as used in other state-of-the-art algo-
rithms). The reason is that in the inline algorithm it is the distance (i.e., the number
of item modifications) and not the accuracy (i.e., the number of transactions from
D O ) that is minimized. These two goals are highly distinct since multiple item mod-
ification may occur to the same transaction and yet produce a penalty of one when
using the accuracy measure, although the actual damage induced to the database is
much higher. Thus, the metric of accuracy would be inappropriate to evaluate the
outcome of the presented methodology.
Search WWH ::




Custom Search