Database Reference
In-Depth Information
The use of a data mining algorithm in a knowledge discovery process is not
a straightforward process: usually the choice of the best algorithm and the best
parameters setting to extract meaningful and useful patterns is difficult even for
an expert analyst user.
In this section we introduce a set of techniques, demonstrated with examples
usingM-Atlas, to drive a user through the mobility knowledge discovery process
by optimizing the data analysis and tuning the parameters setting. The techniques
introduced here have been tailored to the case of mobility data, although they
can be applied to general data mining.
7.2.1 Data Preprocessing
In this section we present some data preprocessing techniques useful in mobility
knowledge discovery, illustrating them through the use of M-Atlas.
Data Validation
Data validation is a necessary step to measure how much the trajectory data
set we are going to analyze is consistent and representative of the real world
phenomena. Here we consider the data already cleaned and reconstructed as
described in Chapter 2 . However, the reconstruction step does not eliminate all
the possible imperfections in the data and errors at higher level may still exist.
This is due to bias in the data (e.g., tracking only a certain category of the
users) or technological problems (i.e., an area where the devices don't work)
that can produce unusual and unwanted effects on the analysis results. To asses
the significance of a data set as a proxy of the real mobility phenomena within
a certain area, the trajectory data set (as a set of spatio-temporal points) can be
compared against a “ground truth” such as survey data composed by a set of
interviews about mobility habits, for example done by phone (or other forms
of a priori knowledge). However, an important issue to be considered in this
comparison is the population of these two data sets. For example, considering the
data set coming from a set of private cars, this covers only vehicular movements,
whereas surveys usually include all kinds of movement, including pedestrians
and public transportation. Second, the automatic collection procedure and the
cleaning step applied for the car data set ensures that all movements are correctly
captured, whereas surveys leave space for omissions or distortions. Finally, the
data provide no explicit semantic information about the purpose of movements,
such as the final destination and profiles of the citizens involved, whereas surveys
explicitly collect this information. A significant difference holds also for the size
of the sample, which can alter the reality represented in the data set. A method
that can help to understand if the data are consistent with the ground truth is
to replicate a statistic analysis for each data set and make a comparison. This
Search WWH ::




Custom Search