Database Reference
In-Depth Information
Make sure that the gora.properties file is set up correctly, as shown:
[hadoop@hc1r1m2 nutch]$ pwd
/usr/local/nutch
[hadoop@hc1r1m2 nutch]$ vi ./conf/gora.properties
Check that Gora is the default data store. The line here should already exist in the file, but it may be commented
out. Uncomment or add the line. This will set the default Gora data store to be Apache HBase:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Remember that each time you change the Nutch configuration, you need to re-compile Nutch. Do so now, so that
the Gora changes take effect:
[hadoop@hc1r1m2 nutch]$ pwd
/usr/local/nutch
[hadoop@hc1r1m2 nutch]$ ant runtime
Buildfile: build.xml
......
BUILD SUCCESSFUL
Total time: 13 minutes 39 seconds
Running the Nutch Crawl
You have managed to start HBase and you know that HBase is storing its data within Hadoop HDFS. You have a
ZooKeeper quorum running, and HBase is able to connect to it without error. Solr has been started and is running
without error on port 8983. Additionally, Nutch Gora has been configured to use HBase for storage. So now you are
ready to run a Nutch crawl, move to the Nutch home directory as shown by the Linux cd command:
[hadoop@hc1r1m2 nutch]$ cd $NUTCH_HOME
[hadoop@hc1r1m2 nutch]$ pwd
/usr/local/nutch
Now make sure that the seed URL is ready in HDFS. (You know it is ready because you stored it there for the
Nutch 1.x crawl.) Checking the contents of the seed file, you can see that it has a single URL line (my website address).
You could have put a few million lines into this file for a larger crawl, but you can try that later.
[hadoop@hc1r1m2 hadoop]$ hadoop dfs -cat /user/hadoop/nutch/urls/seed.txt
http://www.semtech-solutions.co.nz
You can determine the syntax for the crawl by executing the script name without parameters. The error message
tells you how it should be run:
[hadoop@hc1r1m2 nutch]$ cd runtime/deploy/bin
[hadoop@hc1r1m2 bin]$ ./crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
The crawl is executed in the same format as for Nutch 1.x, and the output is shown as follows:
[hadoop@hc1nn bin]$ ./crawl urls crawl1 http://hc1r1m2:8983/solr/ 2
 
Search WWH ::




Custom Search