Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

to ZooKeeper but the connection closes immediately. This could be a sign that

the server has too many connections (30 is the default). Consider inspecting

your ZK server logs for that error and then make sure you are reusing

HBaseConfiguration as often as you can. See HTable's javadoc for more

information.

Make sure that your /etc/hosts file entries are defined correctly and that ZooKeeper is working correctly before

you move on to start HBase.

As for architecture example 1, you can check the Solr query page for your results. For this example, because the

seed URL was from my own site, the query found the apache URLs, as shown in Figure 3-4 .

Figure 3-4. Solr output showing the crawl data

A Brief Comparison

Of the two architecture examples given in this chapter, the second, using HBase, was the most difficult to use. This

may be the future direction that Nutch is taking, but there are a lot more configuration items to take care of. There

are more components to worry about and check, plus more potential areas of failure. Having said that, the second

architecture example gives you the ability to explicitly choose the storage architecture. If for some reason at a future

date you need to use an alternative system to HBase that Gora supports, you will be able to do that.

You have only used Hadoop V1 in both of these examples. If time had allowed, it would have been useful to use

Hadoop V2 as well. In that case, you would have needed to rebuild both HBase and Nutch using Hadoop V2 libraries.

Nevertheless, it would have been interesting to compare the Nutch processing time using Hadoop V1 and V2.

Search WWH ::

Custom Search

Home