Database Reference
In-Depth Information
data that may be useful to group similar customers. Anyway, this is only one
of the many possible data preparation tasks that must be performed over the
data in order to guarantee reliable results. We show a concrete example over
Analysis Services later in this chapter.
Data clustering discovers distribution patterns of the data set. The most
popular clustering methods are based on similarity or distance between data
points. Thus, a notion of distance between clusters must be defined in order
to evaluate the quality of a clustering model. Typically the Euclidean distance
is used. We explain this next.
Let us call d ( x, y ) the distance between two points x and y in a cluster.
In addition, for a given cluster C k , its center r k is computed, often as the
vector mean of the points in the cluster. To determine the quality of a certain
clustering configuration
= C 1 ,...,C K , we need to compute two functions:
the within cluster variation wc (
C
C
)andthe between cluster variation
bc (
). These functions are used to measure the intraclass and interclass
similarity, respectively. The within cluster variation is first computed for each
cluster and then for the whole clustering configuration, as indicated below:
C
K
wc ( C k )=
x ( i ) ∈C k
d ( x, r k ) 2 ; wc (
C
)=
wc ( C k ) .
k =1
The between cluster variation is measured by the distance between cluster
centers and is computed as follows:
)=
1 ≤j≤k≤K
d ( r j ,r k ) 2 .
C
bc (
Finally, the quality of a clustering can be defined by the score function
bc (
).
Clustering algorithms are aimed at optimizing the score functions. The
problem consists in searching the space of assignments of points to clusters
and find the one that minimizes/maximizes the score function. The typical
clustering algorithm, from which many different variants are built, is the K-
means algorithm . Here, the number of K clusters is fixed at the start.
The basic algorithm randomly picks K cluster centers and assigns each point
to the cluster whose mean is closest using the Euclidean distance. For each
cluster, the new center is computed and used to replace each initial cluster
center value. Then, each object is assigned to the cluster with the smallest
squared Euclidean distance. The cluster centers are recalculated based on the
new membership assignment, and the procedure is repeated until no object
changes the clusters. A scheme of the algorithm is given next:
C
) / wc (
C
K-Means Algorithm
INPUT: A data set T containing n data points ( x 1 ,...,x n )
OUTPUT: A set of K clusters C 1 ,...,C K
Search WWH ::




Custom Search