Database Reference
In-Depth Information
4.5 Crawling and Searching the Internet of
Things
The Internet of Things is the beginning of the data-centric web era ,
where the data could be about events, locations or people, as is collected
by the sensor infrastructure, and richly described in the form of RDF
meta-data. Therefore, it is natural to move to the next stage of smart
semantic web search , where data and services about arbitrary “things”
such as people, events and locations can be easily accessed. Providing
such search functionality will be extremely challenging, because the size
of the semantic web continues to grow rapidly, and is expected to be sev-
eral orders of magnitude larger than the conventional web. This leads to
numerous challenging in crawling, indexing and retrieving search results
on the semantic web. While the RDF framework solves the representa-
tion issues for effective search and indexing, the data scalability issue
continues to be an enormous challenge. Nevertheless, such a function-
ality is critical, because search engines can locate the data and services
that other applications may need in a M2M world.
Some early frameworks for semantic web search may be found in
[44, 72]. Some real implementations of meta-data search engines are
Swoogle [35, 129] and Sindice [102, 127]. Among these different frame-
works and implementations, only the last one is recent enough to in-
corporate the full advantages of the MapReduce framework. Generally
speaking, since the semantic web is similar to the conventional web in
terms of being a linked entity, algorithms which are similar to PageRank
can be implemented with a MapReduce framework for ecient retrieval.
The semantic web may require slightly more sophisticated algorithms for
indexing, as compared to the conventional web, because of the greater
richness in the semantic web in terms of accommodating different types
of links. Other tasks such as crawling, are also very similar to the con-
ventional web, in terms of using the linkage structure during the crawling
process. Again, some additional intelligence may be incorporated into
the crawling process, depending upon the importance of different links
and crawling strategies for resource discovery.
A very recent large-scale framework for search and indexing of the web
is Sindice [102, 127]. We will discuss this engine in more detail, because
the high level of scalability, which is incorporated in all aspects of its
design choices. In particular, this is achieved with the use of the MapRe-
duce framework. The first step is to harvest the web with a crawler called
SindiceBot , that collects web and RDF documents. This crawler utilizes
Hadoop in order to distribute the crawling job across multiple machines.
An extension to the Sitemap protocol [128] allows the data sets to be
Search WWH ::




Custom Search