Database Reference
In-Depth Information
the most frequent sequences of regions visited by the users with their traveling
time. The method we propose adjusts the parameters based on the analysis of
the mining results. The objective is to iterate the mining task with different
parameter values toward the objective considering the characteristics of the
resulting patterns. Therefore, depending on the resulting set of patterns, an
action must be taken as summarized here.
The result set is as follows:
Small and contains useful patterns : In this case, the objective of the analyst
is reached.
Too big or the algorithm is not terminating : In this case, the support threshold
is probably is too low and too many regions become frequent, leading to an
explosion of patterns. There are three possible solutions: (1) to increment the
support threshold, (2) check the set of regions to reduce them, or (3) increase
the time tolerance so more patterns will be merged together.
Small, but time intervals are trivial : The time tolerance is too high and makes
the pattern too inclusive, leading to trivial ones. We need to lower the time
tolerance.
Small, but the sequences of regions are trivial : In this case, the support
threshold is too high and the real patterns are hidden in the data or the set of
regions is not meaningful. Some regions could be too large and therefore they
can be split into a finer granularity, thus leading to a better differentiation in
the resulting patterns.
When a reasonable result is obtained, the analyst can apply a pruning in the
postprocessing phase to remove some of the patterns, considering additional
properties such as the number of regions in a T-pattern. The parameter setting
in any data-mining algorithm is recognized in the literature as an open issue
and the optimal solution is far from being trivial. However, having a method-
ology to drive the parameter setting is a first step in searching for a good
solution. Naturally, it could be that in some cases an algorithm is oversensitive
to parameter changes, thus making it extremely difficult to find a good parameter
setting.
The problem of finding a good initial parameter configuration is also worth
a discussion: the analyst can simply start from a reasonable or random set of
thresholds and then start tuning the parameters as described earlier. Another,
smarter possibility is a parameter estimation performed considering the critical
steps of the algorithm. Consider again the basic step of the T-pattern algorithm:
the detection of frequent regions in the area under analysis makes the support
threshold the most influent parameter for the whole process. We present a
heuristics data-driven method to estimate the value for this threshold. This is
based on the cumulative frequency distribution of trajectories in the spatial grid
cells. An example on the Milano data set is shown in Figure 7.2 a. The points
Search WWH ::




Custom Search