Information Technology Reference
In-Depth Information
Fig. 2. Overview of connecting the layers
Although other distance measures exist, e.g., Jaccard, Dice, and Russell/Rao
[21, 22], which have their use when comparing dichotomous data, the measure
of distance which we use is the Euclidean metric, see Formula 1. It gives us the
opportunity to measure distances of continuous multi-dimensional variables, i.e.,
v i āˆˆ R d .
d ( v i , v i )= nāˆ’ 1
( v ik āˆ’ v ik ) 2
1
2
(1)
k =0
Various clustering algorithms have been proposed, e.g. DBSCAN [23]. DB-
SCAN finds clusters based on a density measure, i.e., it finds clusters in which
data instances have only a maximal distance to each other. Hence, points near
to each other are grouped in the same cluster. This may lead to arbitrary shaped
clusters, including spherical cluster shapes. On the one hand, arbitrary shaped
clusters do not lead to any clear results, and on the other hand, clusters in our
case might have varying density values , which is problematic for DBSCAN. The
algorithm of our choice is fixed-width clustering [9, 24]. The algorithm, described
in Figure 3, has the benefit of a better runtime complexity, compared to other
clustering algorithms, e.g., standard k -means, since it computes clusters with
just a single passage through the data instances (fingerprints). In fixed-width
clustering, clusters have a maximal width and a cluster center, called centroid.
Data instances that are clustered based on their feature vector either surpass
the maximal width (based on the distance measure) and create a new cluster
or have a smaller distance and become part of the cluster and have a certain
distance to its centroid. The fewer data instances are inside a cluster the more
probable it is that those data instances are in fact outliers. This is basically the
assumption discussed before: normal behaviour represents the majority of data
instances whereas abnormal behaviour is represented by only few data instances
(which represent potential attacks). Hence, clusters containing fewer instances
than a user-configured threshold, represent anomalous data points. For instance
 
Search WWH ::




Custom Search