Database Reference
In-Depth Information
Summary
In this chapter, you investigated big data collection using Nutch, Solr, Gora, and HBase. You used both Nutch 1.x and
2.x to crawl a seed URL and collect data. Although you crawled only on a small scale, the same process can be used
to gather large volumes of data. You used Solr in both cases to index data passed from Nutch. In the second example,
you used Apache Gora to determine where Nutch would store its data—in this case, it was HBase. You also looked at
two possible approaches for using Nutch and Hadoop. In the first, Nutch implicitly used Hadoop for storage; in the
second, Nutch used Apache Gora to explicitly select HBase for storage.
Where do you go from here? The command sequence and examples given in this chapter should enable you to
apply these approaches to your own system. Take a logical approach, and make sure that HDFS is working before
moving on. Also, make sure that ZooKeeper is working before you attempt HBase. Remember: if you encounter errors,
search the web for solutions, because other people may have encountered similar problems. Also, keep trying to think
of new ways to approach the problem.
 
Search WWH ::




Custom Search