Database Reference
In-Depth Information
Figure 3-3. Solr with crawl results
So, that's it for architecture example 1. You have connected Hadoop 1.2.1 to Nutch 1.8 and indexed the data using
Solr 4.7. Note that during the crawl script, Nutch implicitly checked for Hadoop before using it; otherwise, it would have
used the Linux file system for storage. That is an important point to recognize here because, in the second architectural
example, using Nutch 2.x, you will explicitly configure Nutch to use HBase, and therefore Hadoop as well.
Architecture 2: Nutch 2.x
In the first architecture example, you used Nutch 1.x. When you executed a crawl, Nutch used Hadoop because it
automatically checked whether you were in a distributed environment and it attempted to use Hadoop for storage.
The architecture of this next example enables you to specify the storage you will use for your Nutch crawl. Nutch 2.x
uses Apache Gora ( gora.apache.org ) to abstract the storage layer. You will also use Apache HBase with Nutch. Using
Hadoop and HDFS for storage, Apache HBase ( hbase.apache.org ) offers real-time read/write random access to
big data. Should you later need to choose a different storage option, Gora provides the flexibility to do that; you just
change the Gora configuration. For instance, you might decide to use the Apache Accumulo database
( accumulo.apache.org ).
 
Search WWH ::




Custom Search