Databases Reference
In-Depth Information
Algorithm: DBSCAN:
a density-based clustering algorithm.
Input:
D
: a data set containing
n
objects,
: the radius parameter, and
MinPts
: the neighborhood density threshold.
Output:
A set of density-based clusters.
Method:
mark all objects as
unvisited
;
(1)
(2)
do
(3)
randomly select an unvisited object
p
;
(4)
mark
p
as
visited
;
(5)
if
the
-neighborhood of
p
has at least
MinPts
objects
(6)
create a new cluster
C
, and add
p
to
C
;
(7)
let
N
be the set of objects in the
-neighborhood of
p
;
(8)
for
each point
p
0
in
N
(9) if
p
0
is
unvisited
(10) mark
p
0
as
visited
;
(11) if the -neighborhood of
p
0
has at least
MinPts
points,
add those points to
N
;
(12) if
p
0
is not yet a member of any cluster, add
p
0
to
C
;
(13)
end for
(14) output
C
;
(15)
else
mark
p
as
noise
;
(16)
until
no object is
unvisited
;
Figure10.15
DBSCAN algorithm.
are usually empirically set and difficult to determine, especially for real-world, high-
dimensional data sets. Most algorithms are sensitive to these parameter values: Slightly
different settings may lead to very different clusterings of the data. Moreover, real-world,
high-dimensional data sets often have very skewed distributions such that their intrin-
sic clustering structure may not be well characterized by a single set of
global
density
parameters.
Note that density-based clusters are monotonic with respect to the neighborhood
threshold. That is, in DBSCAN, for a fixed
MinPts
value and two neighborhood thresh-
olds,
1
<
2
, a cluster
C
with respect to
1
and
MinPts
must be a subset of a cluster
C
0
with respect to
2
and
MinPts
. This means that if two objects are in a density-based
cluster, they must also be in a cluster with a lower density requirement.
To overcome the difficulty in using one set of global parameters in clustering analy-
sis, a cluster analysis method called
OPTICS
was proposed. OPTICS does not explicitly
produce a data set clustering. Instead, it outputs a
cluster ordering
. This is a linear list