Database Reference
In-Depth Information
who are asked to provide personal information on Web forms to e-commerce service
providers. The compulsion for doing so may be the (perhaps well-founded) worry
that the requested information may be misused by the service provider to harass the
customer. As a case in point, consider a pharmaceutical company that asks clients to
disclose the diseases they have suffered from in order to investigate the correlations
in their occurrences—for example, “Adult females with malarial infections are also
prone to contract tuberculosis”. The company may be acquiring the data solely for
genuine data mining purposes that would eventually reflect itself in better service to
the client. But, at the same time the client might worry that if her medical records
are either inadvertently or deliberately disclosed, it may adversely affect her future
employment opportunities.
In this section, we study whether customers can be encouraged to provide correct
information by ensuring that the mining process cannot, with any reasonable degree
of certainty, violate their privacy, but at the same time produce sufficiently accurate
mining results. The difficulty in achieving these goals is that privacy and accuracy are
typically contradictory in nature, with the consequence that improving one usually
incurs a cost in the other [ 3 ]. A related issue is the degree of trust that needs to
be placed by the users in third-party intermediaries. And finally, from a practical
viability perspective, the time and resource overheads that are imposed on the data
mining process due to supporting the privacy requirements.
Our study is carried out in the context of extracting association rules from large
historical databases [ 8 ], an extremely popular mining process that identifies inter-
esting correlations between database attributes, such as the one described in the
pharmaceutical example. By the end of Sect. 2, we will show that the state-of-the-art
in input privacy is such that it is indeed possible to simultaneously achieve all the
desirable objectives (i.e., privacy, accuracy, and efficiency) for ARM.
2.1
Problem Framework
In what follows, we describe the framework of the privacy mining problem in the
context of association rules.
Database Model We assume that the original (true) database U consists of N
records, with each record having M categorical attributes. Note that boolean data
is a special case of this class, and further, that continuous-valued attributes can be
converted into categorical attributes by partitioning the domain of the attribute into
fixed length intervals.
The domain of attribute j is denoted by S U , resulting in the domain S U of a
record in U being given by S U
M
j
1 S U . We map the domain S U to the index set
=
=
={
|
S U |}
I U
, thereby modeling the database as a set of N values from I U .If
we denote the i th record of U as U i , then U
1, ... ,
i = 1 , U i
={
U i }
I U .
To make this concrete, consider a database U with 3 categorical attributes Age ,
Sex and Education having the following category values:
Search WWH ::




Custom Search