THE INTERNET OF THINGS: A SURVEY FROM THE DATA-CENTRIC PERSPECTIVE - Managing and Mining Sensor Data

Database Reference

In-Depth Information

4.5 Crawling and Searching the Internet of

Things

The Internet of Things is the beginning of the data-centric web era ,

where the data could be about events, locations or people, as is collected

by the sensor infrastructure, and richly described in the form of RDF

meta-data. Therefore, it is natural to move to the next stage of smart

semantic web search , where data and services about arbitrary “things”

such as people, events and locations can be easily accessed. Providing

such search functionality will be extremely challenging, because the size

of the semantic web continues to grow rapidly, and is expected to be sev-

eral orders of magnitude larger than the conventional web. This leads to

numerous challenging in crawling, indexing and retrieving search results

on the semantic web. While the RDF framework solves the representa-

tion issues for effective search and indexing, the data scalability issue

continues to be an enormous challenge. Nevertheless, such a function-

ality is critical, because search engines can locate the data and services

that other applications may need in a M2M world.

Some early frameworks for semantic web search may be found in

[44, 72]. Some real implementations of meta-data search engines are

Swoogle [35, 129] and Sindice [102, 127]. Among these different frame-

works and implementations, only the last one is recent enough to in-

corporate the full advantages of the MapReduce framework. Generally

speaking, since the semantic web is similar to the conventional web in

terms of being a linked entity, algorithms which are similar to PageRank

can be implemented with a MapReduce framework for ecient retrieval.

The semantic web may require slightly more sophisticated algorithms for

indexing, as compared to the conventional web, because of the greater

richness in the semantic web in terms of accommodating different types

of links. Other tasks such as crawling, are also very similar to the con-

ventional web, in terms of using the linkage structure during the crawling

process. Again, some additional intelligence may be incorporated into

the crawling process, depending upon the importance of different links

and crawling strategies for resource discovery.

A very recent large-scale framework for search and indexing of the web

is Sindice [102, 127]. We will discuss this engine in more detail, because

the high level of scalability, which is incorporated in all aspects of its

design choices. In particular, this is achieved with the use of the MapRe-

duce framework. The first step is to harvest the web with a crawler called

SindiceBot , that collects web and RDF documents. This crawler utilizes

Hadoop in order to distribute the crawling job across multiple machines.

An extension to the Sitemap protocol [128] allows the data sets to be

Search WWH ::

Custom Search

Home