Information Technology Reference
In-Depth Information
frequently than others. Intuitively speaking, very frequent neighbors, or hubs, domi-
nate the neighbor sets and therefore, in the context of similarity-based learning, they
represent the centers of influence within the data. In contrast to hubs, there are rarely
occurring neighbor instances contributing little to the analytic process. We will refer
to them as orphans or anti-hubs .
In order to express hubness in a more precise way, for a time series dataset
D
one can define the k - occurrence of a time series x from
D
, denoted by N k (
x
)
,
as the number of time series in
D
having x among their k nearest neighbors:
N k (
x
) =|{
x i |
x
N k (
x i ) }| .
(11.6)
With the term hubness we refer to the phenomenon that the distribution of N k (
)
becomes significantly skewed to the right. We can measure this skewness, denoted
by
x
S N k ( x )
, with the standardized third moment of N k (
x
)
:
3
[ (
N k (
) μ N k ( x ) )
]
E
x
S N k ( x ) =
(11.7)
3
N k (
˃
x
)
where
μ N k ( x ) and
˃ N k ( x ) are the mean and standard deviation of the distribution of
N k (
S N k ( x ) is higher than zero, the corresponding distribution is skewed to
the right and starts presenting a long tail. It should be noted, though, that the occur-
rence distribution skewness is only one indicator statistic and that the distributions
with the same or similar skewness can still take different shapes.
In the presence of class labels, we distinguish between good hubness and bad
hubness : we say that the time series x is a good k-nearest neighbor of the time
series x ,if(i) x is one of the k -nearest neighbors of x , and (ii) both have the same
class labels. Similarly: we say that the time series x is a bad k-nearest neighbor of
the time series x ,if(i) x is one of the k -nearest neighbors of x , and (ii) they have
different class labels. This allows us to define good ( bad ) k-occurrence of a time
series x , GN k (
x
)
. When
respectively), which is the number of other time series
that have x as one of their good (bad, respectively) k -nearest neighbors. For time
series, both distributions GN k (
x
)
(and BN k (
x
)
are usually skewed, as it is exemplified
in Fig. 11.4 , which depicts the distribution of GN 1 (
x
)
and BN k (
x
)
for some time series data sets
(from the UCR time series dataset collection [ 28 ]). As shown, the distributions have
long tails in which the good hubs occur.
Wesaythatatimeseries x is a good (or bad) hub, if GN k (
x
)
, respec-
tively) is exceptionally large for x . For the nearest neighbor classification of time
series, the skewness of good occurrence is of major importance, because some few
time series are responsible for large portion of the overall error: bad hubs tend to mis-
classify a surprisingly large number of other time series [ 43 ]. Therefore, one has to
take into account the presence of good and bad hubs in time series datasets. While the
k NN classifier is frequently used for time series classification, the k -nearest neighbor
approach is also well suited for learning under class imbalance [ 16 , 20 , 21 ], therefore
x
)
(or BN k (
x
)
Search WWH ::




Custom Search