Database Reference
In-Depth Information
In real applications, there may be several distinct values in the domain of an attribute A . For each
attribute value v of A , let N Ti be the number of tuples with the attribute value v of A in class C i , and thus
the conditional entropy can be defined as
N
N
n
1
E v
( )
=
Ti
×
E T
(
)
(7)
A
i
i
=
And then, the information gain of attribute A can be computed by
g( A ) = E ( T ) - E A ( v )
(8)
For example, consider a fraction results (showed in Table 1) returned by MSN house&home Web
database for a query with the condition “Price between 250000 and 350000 and City = Seattle”. We then
use it to describe how to obtain a best partition attribute by using the formulas defined above.
Here, we assume the decision attributes are View, Schooldistrict, Livingarea, and SqFt. We first
compute the entropy of tree T ,
5
15
5
15
6
15
6
15
4
15
4
15
=
E T
E C C C
log
+
log
+
log
( )
=
(
,
,
)
= −
0
.
471293.
1
2
3
And then, we compute the entropy of each decision attributes. For attribute “View”, it contains four
distinct values which are 'Water', 'Mountain', 'GreenBelt', and 'Street', the entropy of each value are
Table 1. The fraction of query results
ID
Price
Bedrooms
Livingarea
Schooldistrict
View
SqFt
Cluster
01
329000
2
Burien
Highline
Water
712
C1
02
335000
2
Burien
Tukwila
Water
712
C1
03
325000
1
Richmond
Shoreline
Water
530
C1
04
325000
3
Richmond
Shoreline
Water
620
C1
05
328000
3
Richmond
Shoreline
Water
987
C1
06
264950
1
Burien
Seattle
Mountain
530
C2
07
264950
1
C-seattle
Seattle
Mountain
530
C2
08
328000
3
Burien
Seattle
GreenBelt
987
C2
09
349000
2
Burien
Seattle
Water
955
C2
10
339950
2
C-seattle
Seattle
GreenBelt
665
C2
11
339950
3
Burien
Seattle
Street
852
C2
12
264950
4
Richmond
Highline
Street
1394
C3
13
264950
5
C-seattle
Seattle
Mountain
1400
C3
14
338000
5
Burien
Tukwila
Street
1254
C3
15
340000
3
Burien
Tukwila
GreenBelt
1014
C3
 
Search WWH ::




Custom Search