Database Reference
In-Depth Information
Data Mining Step
Data mining is the core step of the process and consists in the execution the
algorithms, as for example the ones presented in Chapter 6 . M-Atlas realizes
this step with a mining statement :
CREATE MODEL <model_table> MINE AS <mining_algorithm_name>
FROM (SELECT t.id, t.object
FROM <trajectories_table> t)
SET <mining_algorithm_name>.<param>= <value> AND ...
As we can see, this statement creates a newmodel as the result of a mining task
specifying the mining algorithm to execute on a selection of trajectories where
the algorithm has to be applied. This set is identified by the SELECT statement
on the trajectories table having as attributes the ID ( t.id ) and the trajectory
object ( t.object ). The SET component defines the algorithm parameters.
Mining a Data Sample
Applying a data mining algorithm to a large trajectory data set may be extremely
time- and memory-consuming, making the direct application of the algorithm
to the entire data set not possible due the time or memory limitation. This
problem can be solved using the data mining algorithms presented in Chapter 6
in combination with data sampling techniques. In general, sampling the data is a
technique to reduce the size of the data without altering the statistical properties.
The data can be sampled using semantic criteria such as dividing the data
using the spatial or temporal characteristics of the trajectories. Whatever sam-
pling technique is chosen by the analyst, the important issue is to maintain the
consistency of the data or, at least, understand exactly the bias introduced, as
this may strongly affect the extracted patterns.
An example of random sampling realized in M-Atlas is expressed as follows:
CREATE MODEL <model_table> MINE AS <mining_algorithm_name>
FROM (SELECT t.id, t.object
FROM <trajectories_table> t
ORDER BY RANDOM()
LIMIT 20%)
SET <mining_algorithm_name>.<param>= <value> AND ...
We notice here the RANDOM keyword that allows us to reorder trajectories in a
random way, selecting only the 20% of them. Once the models are extracted on
the sampled data, we can apply them to the remaining data set to determine their
real support. Chapter 10 presents an example of this technique for the Milano
data set.
Search WWH ::




Custom Search