Database Reference
In-Depth Information
=
FEMALE
,
AGE
=
GT
52 but not
PERSONAL STATUS
=
MALE SINGLE
,
AGE
=
GT
52. A dataset
is a set of transactions. Intuitively, it corresponds to the trans-
actions built from a table.
The support of an itemset
X
w.r.t.
D
D
is the proportion of transactions in
D
sup-
(
)=
|{
∈ D |
⊆
}|/|D|
||
porting
X
:
supp
X
T
X
T
,where
is the cardinality operator.
→
Y
,where
X
and
Y
are disjoint itemsets.
X
is called the
premise
and
Y
is called the
consequence
of the association rule.
We s a y t h a t
X
An association rule is an expression
X
Y
is a
classification rule
if
Y
is a class item. As an example,
PERSONAL STATUS
=
FEMALE
,
AGE
=
GT
52
→
→
CLASS
=
BAD
is a classification
rule for the German credit dataset.
The support of
X
Y
is the support of the itemset obtained by the union of
X
and
Y
, in symbols
supp
→
Y
is the union of
X
and
Y
. Intuitively, the
support of a rule states how often the rule is satisfied in the dataset. A support of
0.1 for the rule
PERSONAL STATUS
=
FEMALE
,
AGE
=
GT
52
(
X
,
Y
)
,where
X
,
CLASS
=
BAD
means that 10% of the transactions support both the premise and the consequence
of the rule, i.e., support
PERSONAL STATUS
=
FEMALE
,
AGE
=
GT
52,
CLASS
=
BAD
. The confidence of
X
→
→
Y
, defined when
supp
(
X
)
>
0, is:
con f
(
X
→
Y
)=
supp
(
X
,
Y
)
/
supp
(
X
)
.
Confidence states the proportion of transactions supporting
Y
among those support-
ing
X
. A confidence of 0.7 for the rule above means that 70% of the transactions sup-
porting
PERSONAL STATUS
=
FEMALE
,
AGE
=
GT
52 also support
CLASS
=
BAD
.
Support and confidence range over
. Since the seminal paper by (Agrawal &
Srikant, 1994), many well explored algorithms have been designed for extracting the
set of
frequent
itemsets, i.e., itemsets with a specified minimum support. A survey
on frequent pattern mining is due to (Han et al. , 2007); a survey on interestingness
measures for association rules is reported by (Geng & Hamilton, 2006); a repository
of implementations is maintained by (Goethals, 2010).
[
0
,
1
]
5.2.2
Measures of Discrimination
A critical problem in the analysis of discrimination is precisely to quantify the de-
gree of discrimination suffered by a given group (say, an ethnic group) in a given
context (say, a geographic area and/or an income range) with respect to a decision
(say, credit denial). We rephrase this problem in a rule based setting: if
A
is the
condition (i.e., the itemset) that characterizes the group which is suspected of be-
ing discriminated against,
B
is the itemset that chacterizes the context, and
C
is the
decision (class) item, then the analysis of discrimination is pursued by studying the
rule
A
C
, together with its confidence with respect to the underlying decision
dataset - namely, how often such a rule is true in the dataset itself.
Civil rights laws explicitly identify the groups to be protected against discrimina-
tion, e.g., women or black people. With our syntax, those groups can be represented
as items, e.g.,
SEX
=
FEMALE
or
RACE
=
BLACK
. Therefore, we can assume that the
laws provide us with a set of items, which we call potentially discriminatory (PD)
,
B
→
Search WWH ::
Custom Search