Databases Reference
In-Depth Information
copy of the sensitive data that is the target of leakage, referred to as table S,
to compute a ranking of queries in the query log over the database suspected
of being the source of leakage. We refer to the output of each query in the
query log as query table Qi. The ranking system ranks each Qi with respect
to its similarity to S.
To perform the ranking, we use one of three measures of proximity be-
tween the sensitive table and the query table. The first, based on informa-
tion retrieval's document frequency similarity measure, computes proximity
by considering partial tuple matches between two tables, while factoring in
the likelihood of the match. Partial tuples are formed by matching a tuple
in Qi with a tuple in S and forming a partial tuple based upon matching at-
tribute pairs. Document frequency is computed as the number of query tables
having the partial tuple. The ranking method favors Qi's having partial tuples
with low document frequency. This method has high sensitivity to even single
partial tuple matches occurring in few query results.
The second, based on record linkage and expectation maximization, mea-
sures proximity by aggregating scores of individual tuple matches. This
method was originally devised to match individuals appearing in different
census records in the absence of surrogate keys uniquely identifying these in-
dividuals in each list such as social security number. The method is adaptive
and estimates the number of record pairs that are true matches and non-
matches based upon attributes that are common to both lists and the likeli-
hood of a random match. It is applied to our information leakage problem by
matching each query table Qi with S. This method has lower sensitivity than
the previous one to unique combinations of matching attributes appearing in
few Qi's.
The third is based on the minimum description length principle and mea-
sures proximity by computing the minimum length (maximum probability)
derivation of the sensitive table from the query result. In contrast with the pre-
vious two methods, this method also examines the similarity of tuples within
S when measuring the similarity of tuples in Qi with tuples in S. This method
is therefore better at differentiating Qi's having multiple matches with near
duplicates in S with Qi's having multiple matches with distinct information
in S.
The following scenario describes a practical application of query ranking.
Suppose that a major credit card company sells copies of its customer database
to its business aliates under a very strict confidentiality agreement. The af-
filiates are allowed to analyze the database for their own business intelligence
purposes, but may not disclose any of the customer information to a third
party. Several months later, a copy of this database is released into the public
domain. The credit card company would like to determine the source of the
disclosure so that it may terminate its business relationship with this aliate.
Using the HDB query ranking system, the company could evaluate the prox-
imity of the leaked database to the queries that accessed the database by using
the appropriate proximity measure algorithm, and rank the relative likelihood
Search WWH ::




Custom Search