Database Reference
In-Depth Information
In Nutch 1.8, the crawl class has been replaced by the crawl script. The crawl script, which will be described later, runs
the whole Nutch crawl for you, as well as storing data in Hadoop. Then, you will learn how to check that your data has
been processed by Solr.
Nutch Installation
For this first example, you will download and install Nutch 1.8 from the Nutch website ( nutch.apache.org ). From
the Downloads page, choose the source version of Nutch 1.8. The -src in the file name means that the source files are
included in the software package as well as the binaries. As in the previous chapter, you download a gzipped tar file
(.tar.gz), then unpack it using the gunzip command, followed by the tar xvf command. (In the tar command, x stands
for “extract,” v for “verbose,” and the file name to process is specified after the f option.)
-rw-rw-r--. 1 hadoop hadoop 2757572 Apr 1 18:12 apache-nutch-1.8-src.tar.gz
[hadoop@hc1nn Downloads]$ gunzip apache-nutch-1.8-src.tar.gz
[hadoop@hc1nn Downloads]$ tar xvf apache-nutch-1.8-src.tar
This leaves the raw unpacked Nutch package software extracted under the directory apache-nutch-1.8-src,
as shown here:
[hadoop@hc1nn Downloads]$ ls -ld apache-nutch-1.8-src
drwxrwxr-x. 7 hadoop hadoop 4096 Apr 1 18:13 apache-nutch-1.8-src
Using the mv command, you move this release to a better location and then set the ownership to the Linux
hadoop user with the chown command. Note that I used the -R switch, which recursively changes ownership on
subdirectories and files under the topmost directory, apache-nutch-1.8-src :
[root@hc1nn Downloads]# mv apache-nutch-1.8-src /usr/local
[root@hc1nn Downloads]# cd /usr/local
[root@hc1nn Downloads]# chown -R hadoop:hadoop apache-nutch-1.8-src
[root@hc1nn Downloads]# ln -s apache-nutch-1.8-src nutch
Now the Nutch installation has been moved to /usr/local/ and a symbolic link has been created to point to the
installed software, called “nutch.” That means that the environment can use this alias to point to the installed software
directory. If a new release of Nutch is required in the future, simply change this link to point to it; the environment will
not need to be changed.
[root@hc1nn local]# ls -ld *nutch*
drwxrwxr-x. 7 hadoop hadoop 4096 Apr 1 18:13 apache-nutch-1.8-src
lrwxrwxrwx. 1 root root 20 Apr 1 18:16 nutch -> apache-nutch-1.8-src
Next, you will set up the configuration files. The first step is to create symbolic links to the Hadoop configuration
files in the Nutch build. This avoids the need to copy changes in the Hadoop configuration to the Nutch build each
time such a change occurs. Create the links as follows:
[root@hc1nn Downloads]# cd /usr/local/nutch/conf
[root@hc1nn Downloads]# ln -s /usr/local/hadoop/conf/core-site.xml core-site.xml
[root@hc1nn Downloads]# ln -s /usr/local/hadoop/conf/hdfs-site.xml hdfs-site.xml
[root@hc1nn Downloads]# ln -s /usr/local/hadoop/conf/hadoop-env.sh hadoop-env.sh
[root@hc1nn Downloads]# ln -s /usr/local/hadoop/conf/mapred-site.xml mapred-site.xml
[root@hc1nn Downloads]# ln -s /usr/local/hadoop/conf/masters masters
[root@hc1nn Downloads]# ln -s /usr/local/hadoop/conf/slaves slaves
 
Search WWH ::




Custom Search