Database Reference
In-Depth Information
similarity and the Manhattan distance functions. The cosine similarity function is
often chosen to compare two documents based on the frequency of each word that
appears in each of the documents [2]. For two points, p and q , at
and , respectively, the Manhattan distance, d 1 , between p and q is
expressed as shown in Equation 4.6 .
4.6
The Manhattan distance function is analogous to the distance traveled by a car in
a city, where the streets are laid out in a rectangular grid (such as city blocks). In
Euclidean distance, the measurement is made in a straight line. Using Equation
4.6 , the distance from (1, 1) to (4, 5) would be |1 - 4| + |1 - 5| = 7. From an
optimization perspective, if there is a need to use the Manhattan distance for a
clustering analysis, the median is a better choice for the centroid than use of the
mean [2].
K-means clustering is applicable to objects that can be described by attributes that
are numerical with a meaningful distance measure. From Chapter 3, interval and
ratio attribute types can certainly be used. However, k-means does not handle
categorical variables well. For example, suppose a clustering analysis is to be
conducted on new car sales. Among other attributes, such as the sale price, the
color of the car is considered important. Although one could assign numerical
values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful
to consider that yellow is as close to red as yellow is to green from a clustering
perspective. In such cases, it may be necessary to use an alternative clustering
methodology. Such methods are described in the next section.
 
Search WWH ::




Custom Search