Data Analytics: Exploiting the Data Warehouse - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

where p i is the relative frequency of class i in the data set T .Iftheset T

contains n samples and a split divides T into two subsets T 1 and T 2 , with

sizes n 1 and n 2 , respectively, the Gini index of the divided data is given by

Gini Split ( T )= n 1

Gini ( T 1 )+ n 2

Gini ( T 2 ) .

The attribute achieving the smallest Gini index value is then chosen to split

the node. Note that there are three different cases to consider, depending on

the kind of attribute, namely, binary, categorical, or continuous. The latter

is the case in our example. Since it would be very expensive to consider all

possible values to split a node, candidate values are taken as the actual values

of the attribute. Thus, for the YearEstablished attribute, we only consider the

values 1961, 1977, 1978, 1995, 2010, and 2012. For instance, splitting the node

using YearEstablished = 1977 results in a subset T 1 containing one record in

class ' B ' and two records in class ' G '(forthevalues

1977) and another

subset T 2 containing two records in class ' B ' and one record in class ' G '(for

the values > 1977). Thus, the Gini index will be

≤

( 3 ) 2

( 3 ) 2 =0 . 444

Gini ( T 1 )=1

−

( 3 ) 2 =0 . 4444

Gini YearEstablished =1977 ( T )= 6 (0 . 444) + 6 (0 . 444) = 0 . 444

( 3 ) 2

Gini ( T 2 )=1

−

Doing the same, for example, with AnnualProfitCont =1 , 000 , 000, we would

obtain a Gini index of 0.495 (we leave the computation to the reader); thus,

we select first the attribute YearEstablished . At a second level, we will use

AnnualProfitCont for a finer classification.

9.1.3 Clustering

Clustering or unsupervised classification is the process of grouping

objects into classes of similar ones. Classes are defined as collections of

objects with high intraclass similarity and low interclass similarity. Let us

motivate the use of clustering in the Northwind case study. In addition to the

classification that we described above, whichisusedtopredictacustomer's

behavior, the Northwind managers also want to have a first idea of the

groups of similar customers that can be defined based on their demographic

characteristics and in the purchases they had made. With this information,

for example, the marketing department can prepare customized offers or

packages. Note that to include information about the customer's purchases,

a join between the tables Customer and CustomerDemographics and the fact

table Sales must be performed, in order to generate a larger table that will be

the input to the clustering algorithm containing, for example, the number of

orders placed, the maximum and minimum amounts of the orders, and other

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home