Recommending Environmental Big Data Using Semantically Guided Machine Learning - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

extracted features attributes, and scientific meaningfulness, to form several variable

subgroups, that is, SILO rainfall (mm/day), SILO rainfall rate (mm/hr.), AWAP rainfall

(mm/day), CosmOz rainfall (mm/day), MODIS post real-time TRMM Multi-Satellite

Precipitation Analysis (TMPA) product (mm/day). Combination of this kind formed

a pool of similar variables, which should be able to cross-validated or complemented

each other in case of missing values from a particular time series within that pool.

The complementary method identified the missing value segments of a time series

and replaced those segments with an average segment based on available other time

series in the same pool. This was done to model missing data segment as a semantic

attribute. Sensor model Ontologies were used in this processing to use the correct

meaning of a time series to avoid any wrong complement. Next a “cross-correlation

technique” was used to measure the similarities between two complemented time

series signals representing similar scenarios (in terms of location and time period).

The other purpose of this layer was to cross-validate similar time series data in the

same pool to find a representative time series from that particular pool [10,28,29]. If

the two signals being compared were completely identical then the cross-correlation

coefficient should be equal to 1 and if there are significant similarities between the

signals it should be close to 0. A scoring protocol was designed on cross-correlation

results. The time series with highest score were selected from each subgroup as best

representative of the associated environmental variable for that time period. The

selected time series from all attribute pools were stored in an integrated structured

array where columns represented different variables whereas rows represented time

frames.

xt

()

xt

()

xt

()

11

12

1

m

xt

()

x

()

t

21

22

2

m

R

=

(15.1)

xt

()

ij

xt

()

xt

()

xt

()

n

1

n

2

nm

Integrated data was represented as a response matrix R where χ ij ( t ) represents daily

value of variable i on the date j , which is the j th location on the common time frame

(Equation 15.1).

15.3.4 F eature r ePresentation l ayer

An important issue with multidimensional Big Data sources is optimal feature

extraction to represent the knowledge within less dimensions. Data mining or unsu-

pervised machine learning techniques are widely being used for feature extraction in

physical, chemical, and environmental sciences [21,34,54,73]. Purpose of this layer

was to preprocess the time series matrix, extract sets of semantic features from this

matrix to create a reduced semantically enriched representation instead of the full

size input, so that the relevant and most significant meaningful information from

the input data would be captured to solve the multivariate problem. The general

multivariate problem in large-scale environmental sensing is commonly referred to

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home