Recommending Environmental Big Data Using Semantically Guided Machine Learning - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

15.3.2 D ata h armonization l ayer

Data harmonization was an important step for data and knowledge integration. The

preprocessing of downloaded time series was an important feature due to the uncer-

tainty associated with data availability. The individual time series was identified

according to the name of the selected site and environmental variable. Data were

available from the time of deployment. As can be expected in the real-world net-

works, each of the available time series had periods with missing values. For some

sensor nodes, there were a number of Infinite values. Initially a filter was designed

to remove all of the Infinite values, and replace them with a “Not a Number” string

to keep the filtering statistically insignificant and the original time frame unaltered.

Data validation and preprocessing was conducted based on available knowledge from

the sensor and sensor network Ontologies. Preprocessed time series data were batch

processed and represented as the daily averaged data. Data from the different sources

measuring the same environmental attribute were harmonized and cross-validated

against each other. Again, different measured attributes from the same node were

also harmonized according to the daily average. This step helped the evaluation and

data visualization processes by reducing the number of data points and also solving

the issues related to different data logging frequencies. It also helped to compress the

data to certain extent without losing any daily observation characteristics. The final

outcome of this layer was to produce multisource-based environmental time series

data harmonized, unit converted if required, and semantically integrated in a single

structure on a daily scale.

15.3.3 s emantiC C ross -v aliDation l ayer

Semantics representations are usually intended as a medium for conveying the mean-

ing about some world or environment. A knowledge representation must therefore

have a semantic theory that provides an account in which a particular representation

corresponds to the external world or environment. Preprocessed data were cross-

validated using semantic metadata matching and statistical cross-correlation calcula-

tion. Metadata is “data about the data” and it can be provided the description of what,

where, who, and how about the data [5,53,61]. For example, a sensor node metadata

could describe when and where the sensor node was deployed, who deployed that

node, which environmental attributes are being measured, what are the key semantic

features or characteristics of that particular sensory system, and finally the valid

range of measurement that could be expected. However, metadata are generally

used to describe the principal aspect of data with the aim of sharing, reusing, and

understanding heterogeneous data sets. In fact, different types of sensor or sensor-

simulation model metadata may be considered, namely, static and dynamic sensor

metadata and associated sensing information. Based on natural language processing

and sensor-model ontologies, a cross-validation layer was created. Ideally, all similar

environmental variables from different data sources should be able to cross-validate

each other statistically, as representative similar variables for the same location for

the same time frame should be statistically very similar. Variables were semantically

matched according to their units, attributes they measure, context of the semantically

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home