Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Summary

In this chapter, you investigated big data collection using Nutch, Solr, Gora, and HBase. You used both Nutch 1.x and

2.x to crawl a seed URL and collect data. Although you crawled only on a small scale, the same process can be used

to gather large volumes of data. You used Solr in both cases to index data passed from Nutch. In the second example,

you used Apache Gora to determine where Nutch would store its data—in this case, it was HBase. You also looked at

two possible approaches for using Nutch and Hadoop. In the first, Nutch implicitly used Hadoop for storage; in the

second, Nutch used Apache Gora to explicitly select HBase for storage.

Where do you go from here? The command sequence and examples given in this chapter should enable you to

apply these approaches to your own system. Take a logical approach, and make sure that HDFS is working before

moving on. Also, make sure that ZooKeeper is working before you attempt HBase. Remember: if you encounter errors,

search the web for solutions, because other people may have encountered similar problems. Also, keep trying to think

of new ways to approach the problem.

Search WWH ::

Custom Search

Home