Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Make sure that the gora.properties file is set up correctly, as shown:

[hadoop@hc1r1m2 nutch]$ pwd

/usr/local/nutch

[hadoop@hc1r1m2 nutch]$ vi ./conf/gora.properties

Check that Gora is the default data store. The line here should already exist in the file, but it may be commented

out. Uncomment or add the line. This will set the default Gora data store to be Apache HBase:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Remember that each time you change the Nutch configuration, you need to re-compile Nutch. Do so now, so that

the Gora changes take effect:

[hadoop@hc1r1m2 nutch]$ pwd

/usr/local/nutch

[hadoop@hc1r1m2 nutch]$ ant runtime

Buildfile: build.xml

......

BUILD SUCCESSFUL

Total time: 13 minutes 39 seconds

Running the Nutch Crawl

You have managed to start HBase and you know that HBase is storing its data within Hadoop HDFS. You have a

ZooKeeper quorum running, and HBase is able to connect to it without error. Solr has been started and is running

without error on port 8983. Additionally, Nutch Gora has been configured to use HBase for storage. So now you are

ready to run a Nutch crawl, move to the Nutch home directory as shown by the Linux cd command:

[hadoop@hc1r1m2 nutch]$ cd $NUTCH_HOME

[hadoop@hc1r1m2 nutch]$ pwd

/usr/local/nutch

Now make sure that the seed URL is ready in HDFS. (You know it is ready because you stored it there for the

Nutch 1.x crawl.) Checking the contents of the seed file, you can see that it has a single URL line (my website address).

You could have put a few million lines into this file for a larger crawl, but you can try that later.

[hadoop@hc1r1m2 hadoop]$ hadoop dfs -cat /user/hadoop/nutch/urls/seed.txt

You can determine the syntax for the crawl by executing the script name without parameters. The error message

tells you how it should be run:

[hadoop@hc1r1m2 nutch]$ cd runtime/deploy/bin

[hadoop@hc1r1m2 bin]$ ./crawl

Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

The crawl is executed in the same format as for Nutch 1.x, and the output is shown as follows:

[hadoop@hc1nn bin]$ ./crawl urls crawl1 http://hc1r1m2:8983/solr/ 2

Search WWH ::

Custom Search

Home