Inline Algorithm - Association Rule Hiding for Data Mining

Database Reference

In-Depth Information

novel global measure of quality, called distance, is proposed in [23], which aims to

capture the proximity between the original and the sanitized database. Distance is

formally defined as follows:

å

n2[1;N];m2[1;M]

å

n2[1;N];m2[1;M]

T nm

=jfT nm 6= T nm gj (14.1)

dist(D O ;D) =

T nm

where T nm refers to the original database D O and T nm to its sanitized version D, N

corresponds to the number of transactions in D O ;D, and M to the number of items.

Due to the fact that the inline algorithm supports only item deletions from selected

transactions, minimizing the outcome of the above formula is exactly the same as

maximizing the number of ones left in D. Thus, the above formula can be restated

as follows:

!

å

n2[1;N];m2[1;M]

T nm

minfdist(D O ;D)g= max

(14.2)

Formula (14.2) captures the optimization criterion that is used by the inline al-

gorithm for the formulation of the CSP, discussed in detail in the following section.

The distance metric allows the selection of the optimal hiding solution, among the

exact ones (in the case that many exact hiding solutions exist for the problem at

hand), or the approximation of the optimal solution to the best possible extent (in

the case that an exact hiding solution cannot be found). Since only the support of the

minimal sensitive itemsets, i.e. the itemsets in S min , has to be reduced for the hiding

of the sensitive knowledge, the T nm values in D O that will be zeroed out in D will

have to correspond to items appearing in transactions that support itemsets in S min .

Having defined the optimization criterion that has to be minimized, the next target

is to identify the set of itemsets in which this criterion will be applied. Solving the

entire problem involving alljÃ(I)j1 = 2 M 1 itemsets in the lattice ofD O is well

known to be NP-hard (c.f. [10, 47]). Therefore, the inline algorithm has to apply the

optimization criterion to only a subset of the transactions of the original database.

Prior to discussing the selection process that is adopted by the inline algorithm, we

present the two possible constraints that can hold for a frequent itemset I in D.

An itemset I from D O will continue to be frequent in D if and only if

N

n=1 i m 2I T nm msup

sup(I;D) msup )

(14.3)

and will be infrequent otherwise, i.e. when

N

n=1 i m 2I T nm < msup

sup(I;D) < msup )

(14.4)

Based on the above inequalities, one can observe that any exact solution will

necessarily satisfy them for the entire problem (i.e., for all the itemsets in the lattice

of D). Specifically, for all frequent itemsets appearing in F 0 D , these itemsets should

Association Rule Hiding for Data Mining

Search WWH ::

Custom Search

Home