Database Reference
In-Depth Information
where p i is the relative frequency of class i in the data set T .Iftheset T
contains n samples and a split divides T into two subsets T 1 and T 2 , with
sizes n 1 and n 2 , respectively, the Gini index of the divided data is given by
Gini Split ( T )= n 1
n
Gini ( T 1 )+ n 2
n
Gini ( T 2 ) .
The attribute achieving the smallest Gini index value is then chosen to split
the node. Note that there are three different cases to consider, depending on
the kind of attribute, namely, binary, categorical, or continuous. The latter
is the case in our example. Since it would be very expensive to consider all
possible values to split a node, candidate values are taken as the actual values
of the attribute. Thus, for the YearEstablished attribute, we only consider the
values 1961, 1977, 1978, 1995, 2010, and 2012. For instance, splitting the node
using YearEstablished = 1977 results in a subset T 1 containing one record in
class ' B ' and two records in class ' G '(forthevalues
1977) and another
subset T 2 containing two records in class ' B ' and one record in class ' G '(for
the values > 1977). Thus, the Gini index will be
( 3 ) 2
( 3 ) 2 =0 . 444
Gini ( T 1 )=1
( 3 ) 2 =0 . 4444
Gini YearEstablished =1977 ( T )= 6 (0 . 444) + 6 (0 . 444) = 0 . 444
( 3 ) 2
Gini ( T 2 )=1
Doing the same, for example, with AnnualProfitCont =1 , 000 , 000, we would
obtain a Gini index of 0.495 (we leave the computation to the reader); thus,
we select first the attribute YearEstablished . At a second level, we will use
AnnualProfitCont for a finer classification.
9.1.3 Clustering
Clustering or unsupervised classification is the process of grouping
objects into classes of similar ones. Classes are defined as collections of
objects with high intraclass similarity and low interclass similarity. Let us
motivate the use of clustering in the Northwind case study. In addition to the
classification that we described above, whichisusedtopredictacustomer's
behavior, the Northwind managers also want to have a first idea of the
groups of similar customers that can be defined based on their demographic
characteristics and in the purchases they had made. With this information,
for example, the marketing department can prepare customized offers or
packages. Note that to include information about the customer's purchases,
a join between the tables Customer and CustomerDemographics and the fact
table Sales must be performed, in order to generate a larger table that will be
the input to the clustering algorithm containing, for example, the number of
orders placed, the maximum and minimum amounts of the orders, and other
 
Search WWH ::




Custom Search