Data Analytics: Exploiting the Data Warehouse - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

data that may be useful to group similar customers. Anyway, this is only one

of the many possible data preparation tasks that must be performed over the

data in order to guarantee reliable results. We show a concrete example over

Analysis Services later in this chapter.

Data clustering discovers distribution patterns of the data set. The most

popular clustering methods are based on similarity or distance between data

points. Thus, a notion of distance between clusters must be defined in order

to evaluate the quality of a clustering model. Typically the Euclidean distance

is used. We explain this next.

Let us call d ( x, y ) the distance between two points x and y in a cluster.

In addition, for a given cluster C k , its center r k is computed, often as the

vector mean of the points in the cluster. To determine the quality of a certain

clustering configuration

= C 1 ,...,C K , we need to compute two functions:

the within cluster variation wc (

C

)andthe between cluster variation

bc (

). These functions are used to measure the intraclass and interclass

similarity, respectively. The within cluster variation is first computed for each

cluster and then for the whole clustering configuration, as indicated below:

C

K

wc ( C k )=

x ( i ) ∈C k

d ( x, r k ) 2 ; wc (

C

)=

wc ( C k ) .

k =1

The between cluster variation is measured by the distance between cluster

centers and is computed as follows:

)=

1 ≤j≤k≤K

d ( r j ,r k ) 2 .

C

bc (

Finally, the quality of a clustering can be defined by the score function

bc (

).

Clustering algorithms are aimed at optimizing the score functions. The

problem consists in searching the space of assignments of points to clusters

and find the one that minimizes/maximizes the score function. The typical

clustering algorithm, from which many different variants are built, is the K-

means algorithm . Here, the number of K clusters is fixed at the start.

The basic algorithm randomly picks K cluster centers and assigns each point

to the cluster whose mean is closest using the Euclidean distance. For each

cluster, the new center is computed and used to replace each initial cluster

center value. Then, each object is assigned to the cluster with the smallest

squared Euclidean distance. The cluster centers are recalculated based on the

new membership assignment, and the procedure is repeated until no object

changes the clusters. A scheme of the algorithm is given next:

C

) / wc (

C

K-Means Algorithm

INPUT: A data set T containing n data points ( x 1 ,...,x n )

OUTPUT: A set of K clusters C 1 ,...,C K

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home