Databases Reference
In-Depth Information
Table 3.1 Example 2.1's 22 Contingency Table Data
male female Total
fiction
250 (90)
200 (360)
450
non fiction
50 (210)
1000 (840)
1050
Total
300
1200
1500
Note: Are gender and preferred reading correlated?
2
Using Eq. (3.1) for
computation, we get
2
2
2
2
C .
50210
/
C .
200360
/
C .
1000840
/
D .
25090
/
2
90
210
360
840
D 284.44C121.90C71.11C30.48 D 507.93.
For this 22 table, the degrees of freedom are
.
21
/.
21
/D 1. For 1 degree of free-
2 value needed to reject the hypothesis at the 0.001 significance level is 10.828
(taken from the table of upper percentage points of the
dom, the
2 distribution, typically avail-
able from any textbook on statistics). Since our computed value is above this, we can
reject the hypothesis that gender and preferred reading are independent and conclude
that the two attributes are (strongly) correlated for the given group of people.
Correlation Coefficient for Numeric Data
For numeric attributes, we can evaluate the correlation between two attributes, A and B ,
by computing the correlation coefficient (also known as Pearson's product moment
coefficient , named after its inventer, Karl Pearson). This is
n X
i D1 .
n X
i D1 .
a i N A
b i N B
a i b i / n N A N B
/.
/
r A , B D
D
,
(3.3)
n
A B
n
A B
where n is the number of tuples, a i and b i are the respective values of A and B in tuple i ,
N A and N B are the respective mean values of A and B ,
A and
B are the respective standard
deviations of A and B (as defined in Section 2.2.2), and
is the sum of the AB
cross-product (i.e., for each tuple, the value for A is multiplied by the value for B in that
tuple). Note that 1 r A , B C1. If r A , B is greater than 0, then A and B are positively
correlated , meaning that the values of A increase as the values of B increase. The higher
the value, the stronger the correlation (i.e., the more each attribute implies the other).
Hence, a higher value may indicate that A (or B ) may be removed as a redundancy.
If the resulting value is equal to 0, then A and B are independent and there is no
correlation between them. If the resulting value is less than 0, then A and B are negatively
correlated , where the values of one attribute increase as the values of the other attribute
decrease. This means that each attribute discourages the other. Scatter plots can also be
used to view correlations between attributes (Section 2.2.3). For example, Figure 2.8's
6.
a i b i /
Search WWH ::




Custom Search