The Discovery of Discrimination - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

= FEMALE , AGE = GT 52 but not PERSONAL STATUS = MALE SINGLE , AGE =

GT 52. A dataset

is a set of transactions. Intuitively, it corresponds to the trans-

actions built from a table.

The support of an itemset X w.r.t.

is the proportion of transactions in

sup-

(

)= |{

∈ D |

⊆

}|/|D|

porting X : supp

,where

is the cardinality operator.

→

Y ,where X and Y are disjoint itemsets.

X is called the premise and Y is called the consequence of the association rule.

We s a y t h a t X

An association rule is an expression X

Y is a classification rule if Y is a class item. As an example,

PERSONAL STATUS = FEMALE , AGE = GT 52

→

CLASS = BAD is a classification

rule for the German credit dataset.

The support of X

Y is the support of the itemset obtained by the union of X

and Y , in symbols supp

→

Y is the union of X and Y . Intuitively, the

support of a rule states how often the rule is satisfied in the dataset. A support of

0.1 for the rule PERSONAL STATUS = FEMALE , AGE = GT 52

(

)

,where X

CLASS = BAD

means that 10% of the transactions support both the premise and the consequence

of the rule, i.e., support PERSONAL STATUS = FEMALE , AGE = GT 52, CLASS =

BAD . The confidence of X

→

Y , defined when supp

(

) >

0, is:

con f

(

→

supp

(

) /

supp

(

) .

Confidence states the proportion of transactions supporting Y among those support-

ing X . A confidence of 0.7 for the rule above means that 70% of the transactions sup-

porting PERSONAL STATUS = FEMALE , AGE = GT 52 also support CLASS = BAD .

Support and confidence range over

. Since the seminal paper by (Agrawal &

Srikant, 1994), many well explored algorithms have been designed for extracting the

set of frequent itemsets, i.e., itemsets with a specified minimum support. A survey

on frequent pattern mining is due to (Han et al. , 2007); a survey on interestingness

measures for association rules is reported by (Geng & Hamilton, 2006); a repository

of implementations is maintained by (Goethals, 2010).

[

]

5.2.2

Measures of Discrimination

A critical problem in the analysis of discrimination is precisely to quantify the de-

gree of discrimination suffered by a given group (say, an ethnic group) in a given

context (say, a geographic area and/or an income range) with respect to a decision

(say, credit denial). We rephrase this problem in a rule based setting: if A is the

condition (i.e., the itemset) that characterizes the group which is suspected of be-

ing discriminated against, B is the itemset that chacterizes the context, and C is the

decision (class) item, then the analysis of discrimination is pursued by studying the

rule A

C , together with its confidence with respect to the underlying decision

dataset - namely, how often such a rule is true in the dataset itself.

Civil rights laws explicitly identify the groups to be protected against discrimina-

tion, e.g., women or black people. With our syntax, those groups can be represented

as items, e.g., SEX = FEMALE or RACE = BLACK . Therefore, we can assume that the

laws provide us with a set of items, which we call potentially discriminatory (PD)

→

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home