Databases Reference
In-Depth Information
Jaccard Distance or Similarity
This gives the distance between a set of objects—for example, a
list of Cathy's friends A = Kahn , Mark , Laura , . . . and a list of
Rachel's friends B = Mladen , Kahn , Mark , . . . —and says how
similar those two sets are: J A , B =
A B
A B
.
Mahalanobis Distance
Also can be used between two real-valued vectors and has the
advantage over Euclidean distance that it takes into account cor‐
T S −1
relation and is scale-invariant. d
x , y
=
x y
x y
,
where S is the covariance matrix.
Hamming Distance
Can be used to find the distance between two strings or pairs of
words or DNA sequences of the same length. The distance be‐
tween olive and ocean is 4 because aside from the “o” the other 4
letters are different. The distance between shoe and hose is 3 be‐
cause aside from the “e” the other 3 letters are different. You just
go through each position and check whether the letters the same
in that position, and if not, increment your count by 1.
Manhattan
This is also a distance between two real-valued k-dimensional
vectors. The image to have in mind is that of a taxi having to travel
the city streets of Manhattan, which is laid out in a grid-like fash‐
ion (you can't cut diagonally across buildings). The distance is
therefore defined as d
=∑ i k
x , y
x i y i , where i is the i th ele‐
ment of each of the vectors.
There are many more distance metrics available to you depending on
your type of data. We start with a Google search when we're not sure
where to start.
What if your attributes are a mixture of kinds of data? This happens
in the case of the movie ratings example: some were numerical at‐
tributes, such as budget and number of actors, and one was categorical,
genre. But you can always define your own custom distance metric.
For example, you can say if movies are the same genre, that will con‐
tribute “0” to their distance. But if they're of a different genre, that will
contribute “10,” where you picked the value 10 based on the fact that
this was on the same scale as budget (millions of dollars), which is in
Search WWH ::




Custom Search