Database Reference
In-Depth Information
14/04/06 17:05:53 INFO mapred.JobClient: HDFS_BYTES_READ=4096
14/04/06 17:05:53 INFO mapred.JobClient: FILE_BYTES_WRITTEN=246622
14/04/06 17:05:53 INFO mapred.JobClient: Map-Reduce Framework
14/04/06 17:05:53 INFO mapred.JobClient: Map output materialized bytes=12
14/04/06 17:05:53 INFO mapred.JobClient: Map input records=43
14/04/06 17:05:53 INFO mapred.JobClient: Reduce shuffle bytes=12
14/04/06 17:05:53 INFO mapred.JobClient: Spilled Records=0
14/04/06 17:05:53 INFO mapred.JobClient: Map output bytes=0
14/04/06 17:05:53 INFO mapred.JobClient: Total committed heap usage (bytes)=360120320
14/04/06 17:05:53 INFO mapred.JobClient: CPU time spent (ms)=3040
14/04/06 17:05:53 INFO mapred.JobClient: Map input bytes=3431
14/04/06 17:05:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=242
14/04/06 17:05:53 INFO mapred.JobClient: Combine input records=0
14/04/06 17:05:53 INFO mapred.JobClient: Reduce input records=0
14/04/06 17:05:53 INFO mapred.JobClient: Reduce input groups=0
14/04/06 17:05:53 INFO mapred.JobClient: Combine output records=0
14/04/06 17:05:53 INFO mapred.JobClient: Physical memory (bytes) snapshot=408395776
14/04/06 17:05:53 INFO mapred.JobClient: Reduce output records=0
14/04/06 17:05:53 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4121174016
14/04/06 17:05:53 INFO mapred.JobClient: Map output records=0
14/04/06 17:05:53 INFO indexer.CleaningJob: CleaningJob: finished at 2014-04-06 17:05:53, elapsed: 00:00:31
This output has been clipped because it is too long to include all of it here. As long as you get to the CleaningJob
line, you know that the cycle has completed.
Look for any warnings and errors in this output. Common errors relate to undefined or unexpected document
tokens being found while crawling. Updating the schema.xml before starting Solr or attempting the crawl will
minimize these. Also, check the Hadoop logs and the Nutch log under:
$HADOOP_PREFIX/logs/
$NUTCH_HOME/runtime/local/logs/
You can now check the Hadoop file system and see the data being stored there:
[hadoop@hc1nn nutch]$ hadoop fs -ls /user/hadoop
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:07
/user/hadoop/crawl
drwxr-xr-x - hadoop supergroup 0 2014-04-06 11:46
/user/hadoop/nutch
The Nutch crawl directory stores the current and old data in subdirectories:
[hadoop@hc1nn nutch]$ hadoop fs -ls /user/hadoop/crawl/crawldb
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:16
/user/hadoop/crawl/crawldb/current
drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:07
/user/hadoop/crawl/crawldb/old
 
Search WWH ::




Custom Search