Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

similarity and the Manhattan distance functions. The cosine similarity function is

often chosen to compare two documents based on the frequency of each word that

appears in each of the documents [2]. For two points, p and q , at

and , respectively, the Manhattan distance, d 1 , between p and q is

expressed as shown in Equation 4.6 .

4.6

The Manhattan distance function is analogous to the distance traveled by a car in

a city, where the streets are laid out in a rectangular grid (such as city blocks). In

Euclidean distance, the measurement is made in a straight line. Using Equation

4.6 , the distance from (1, 1) to (4, 5) would be |1 - 4| + |1 - 5| = 7. From an

optimization perspective, if there is a need to use the Manhattan distance for a

clustering analysis, the median is a better choice for the centroid than use of the

mean [2].

K-means clustering is applicable to objects that can be described by attributes that

are numerical with a meaningful distance measure. From Chapter 3, interval and

ratio attribute types can certainly be used. However, k-means does not handle

categorical variables well. For example, suppose a clustering analysis is to be

conducted on new car sales. Among other attributes, such as the sale price, the

color of the car is considered important. Although one could assign numerical

values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful

to consider that yellow is as close to red as yellow is to green from a clustering

perspective. In such cases, it may be necessary to use an alternative clustering

methodology. Such methods are described in the next section.

Search WWH ::

Custom Search

Home