Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

14/04/13 17:28:10 INFO crawl.InjectorJob: InjectorJob: starting at 2014-04-13 17:28:10

14/04/13 17:28:10 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: nutch/urls

14/04/13 17:28:11 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.2-1031432,

built on 11/05/2010 05:32 GMT

14/04/13 17:28:11 INFO zookeeper.ZooKeeper: Client environment:host.name=hc1r1m2

14/04/13 17:28:11 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_30

14/04/13 17:28:11 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc.

14/04/13 17:28:11 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-1.6.0-

openjdk-1.6.0.0/jre

14/04/13 17:28:11 INFO zookeeper.ZooKeeper: Client

................

14/04/13 17:37:20 INFO mapred.JobClient: Job complete: job_201404131430_0019

14/04/13 17:37:21 INFO mapred.JobClient: Counters: 6

14/04/13 17:37:21 INFO mapred.JobClient: Job Counters

14/04/13 17:37:21 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=54738

14/04/13 17:37:21 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving

slots (ms)=0

14/04/13 17:37:21 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0

14/04/13 17:37:21 INFO mapred.JobClient: Launched map tasks=8

14/04/13 17:37:21 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0

You need to monitor all of your logs; that is, you need to monitor the following:

• ZooKeeper logs , in this case under /var/log/zookeeper. These allow you to ensure that all

servers are up and running as a quorum.

• Hadoop logs , in this case under /usr/local/hadoop/logs. Hadoop and MR must be running

without error so that HBase can use Hadoop.

• HBaselogs , in this case under /usr/local/hbase/logs. You make sure that HBase is running and

able to talk to ZooKeeper.

• Solr output from the Solr session window. It must be running without error so that it can index

the crawl output.

• Nutch output from the crawl session . Any errors will appear in the session window.

Each of the components in this architecture must work for the Nutch crawl to work. If you encounter errors, pay

particular attention to your configuration. For timeout errors in ZooKeeper, try increasing the tickTime and

syncLimit values in your ZooKeeper config files.

Potential Errors

Here are some of the errors that occurred when I tried to use this configuration. They are provided here along with

their reasons and solutions. If you encounter them, go back to the step you missed and correct the error.

Consider the first one:

2014-04-08 19:05:39,334 ERROR

org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master

java.io.IOException: Couldnt start ZK at requested address of 2181, instead

got: 2182. Aborting. Why? Because clients (eg shell) wont be able to find this

Search WWH ::

Custom Search

Home