Database Reference
In-Depth Information
This is a small-scale example, so you will start with a single URL to follow. In a real situation, you would populate
the crawl database with a large volume of URLs by using a much larger seed file. You can also download seed file
databases from the Internet. The file at http://rdf.dmoz.org/rdf/content.rdf.u8.gz has around 3 million URLs.
No matter the size of your seed file, you need to copy it (seed.txt, in this case) from the Linux file system to HDFS
(to the nutch/urls directory):
[hadoop@hc1nn nutch]$ hadoop dfs -put urls/seed.txt nutch/urls
[hadoop@hc1nn nutch]$ hadoop dfs -ls nutch/urls
Found 1 items
-rw-r--r-- 1 hadoop supergroup 19 2014-04-05 13:19
/user/hadoop/nutch/urls/seed.txt
Found in $NUTCH_HOME/runtime/deploy/bin, the Nutch crawl command is actually a shell script that
automates the sequence of Nutch operations, as follows:
1.
Inject . Inserts a URL into the Nutch crawl database.
2.
Generate . Creates a fetch list from the Nutch database for the crawl. This creates a
segment directory within the crawl database for fetch processing.
3.
Fetch . Runs the fetcher against the segment created in step 2.
4.
Parse . Processes the results of the fetch.
5.
Update db . Updates the Nutch crawl database with the results of the parse.
6.
Invertlinks . Creates a link map, listing incoming links for this URL.
7.
Dedup . Deletes duplicate documents that are in the index.
8.
Index . Runs the indexer on the database.
9.
Clean . Cleans up after the crawl cycle.
For the actual crawl itself, the syntax of the crawl command is:
crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
The seed directory, nutch/urls, will be sourced from HDFS, which is why you copied the URL list to HDFS.
The Solr URL gives Nutch a link to the Solr instance you started in the last section. The number of rounds is actually
the depth that the crawl will process to.
The crawl script runs a couple of steps to decide whether it should use Hadoop for storage. First, it looks for the
job file:
mode=local
if [ -f ../*nutch-*.job ]; then
mode=distributed
fi
 
Search WWH ::




Custom Search