Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

14/04/06 17:05:53 INFO mapred.JobClient: HDFS_BYTES_READ=4096

14/04/06 17:05:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=246622

14/04/06 17:05:53 INFO mapred.JobClient: Map-Reduce Framework

14/04/06 17:05:53 INFO mapred.JobClient: Map output materialized bytes=12

14/04/06 17:05:53 INFO mapred.JobClient: Map input records=43

14/04/06 17:05:53 INFO mapred.JobClient: Reduce shuffle bytes=12

14/04/06 17:05:53 INFO mapred.JobClient: Spilled Records=0

14/04/06 17:05:53 INFO mapred.JobClient: Map output bytes=0

14/04/06 17:05:53 INFO mapred.JobClient: Total committed heap usage (bytes)=360120320

14/04/06 17:05:53 INFO mapred.JobClient: CPU time spent (ms)=3040

14/04/06 17:05:53 INFO mapred.JobClient: Map input bytes=3431

14/04/06 17:05:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=242

14/04/06 17:05:53 INFO mapred.JobClient: Combine input records=0

14/04/06 17:05:53 INFO mapred.JobClient: Reduce input records=0

14/04/06 17:05:53 INFO mapred.JobClient: Reduce input groups=0

14/04/06 17:05:53 INFO mapred.JobClient: Combine output records=0

14/04/06 17:05:53 INFO mapred.JobClient: Physical memory (bytes) snapshot=408395776

14/04/06 17:05:53 INFO mapred.JobClient: Reduce output records=0

14/04/06 17:05:53 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4121174016

14/04/06 17:05:53 INFO mapred.JobClient: Map output records=0

14/04/06 17:05:53 INFO indexer.CleaningJob: CleaningJob: finished at 2014-04-06 17:05:53, elapsed: 00:00:31

This output has been clipped because it is too long to include all of it here. As long as you get to the CleaningJob

line, you know that the cycle has completed.

Look for any warnings and errors in this output. Common errors relate to undefined or unexpected document

tokens being found while crawling. Updating the schema.xml before starting Solr or attempting the crawl will

minimize these. Also, check the Hadoop logs and the Nutch log under:

$HADOOP_PREFIX/logs/

$NUTCH_HOME/runtime/local/logs/

You can now check the Hadoop file system and see the data being stored there:

[hadoop@hc1nn nutch]$ hadoop fs -ls /user/hadoop

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:07

/user/hadoop/crawl

drwxr-xr-x - hadoop supergroup 0 2014-04-06 11:46

/user/hadoop/nutch

The Nutch crawl directory stores the current and old data in subdirectories:

[hadoop@hc1nn nutch]$ hadoop fs -ls /user/hadoop/crawl/crawldb

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:16

/user/hadoop/crawl/crawldb/current

drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:07

/user/hadoop/crawl/crawldb/old

Search WWH ::

Custom Search

Home