Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

[hadoop@hc1nn ~]$ ls -l .bashrc*

lrwxrwxrwx. 1 hadoop hadoop 16 Jun 30 17:59 .bashrc -> .bashrc_hadoopv2

-rw-r--r--. 1 hadoop hadoop 1586 Jun 18 17:08 .bashrc_hadoopv1

-rw-r--r--. 1 hadoop hadoop 1588 Jul 27 11:33 .bashrc_hadoopv2

The Linux pwd command shows that the current location is the Linux hadoop user's home directory /home/

hadoop/. The Linux ls command produces a long listing that shows a symbolic link called .bashrc, which points

to either a Hadoop V1 or a V2 version of the bashrc configuration file. Currently it is pointing to V2, so you need to

change it back to V1. (I will not explain the contents of the files, as they are listed in Chapter 2).

Delete the symbolic link named .bashrc by using the Linux rm command, then re-create it to point to the V1 file

by using the Linux ln command with a -s (symbolic) switch:

[hadoop@hc1nn ~]$ rm .bashrc

[hadoop@hc1nn ~]$ ln -s .bashrc_hadoopv1 .bashrc

[hadoop@hc1nn ~]$ ls -l .bashrc*

lrwxrwxrwx 1 hadoop hadoop 16 Nov 12 18:32 .bashrc -> .bashrc_hadoopv1

-rw-r--r--. 1 hadoop hadoop 1586 Jun 18 17:08 .bashrc_hadoopv1

-rw-r--r--. 1 hadoop hadoop 1588 Jul 27 11:33 .bashrc_hadoopv2

That creates the correct environment configuration file for the Linux hadopop account, but how does it now take

effect? Either log out using the exit command and log back in, or use the following:

[hadoop@hc1nn ~]$ . ./.bashrc

This means that the .bashrc is executed in the current shell (denoted by the first “ . ” character). The ./ specifies

that the .bashrc file is sourced from the current directory. Now, you are ready to start the Hadoop V1 servers.

Starting the Servers

The Hadoop V1 environment has been configured, and the V2 Hadoop servers have already been stopped. Now, you

change to the proper directory and start the servers:

[hadoop@hc1nn ~]$ cd $HADOOP_PREFIX/bin

[hadoop@hc1nn hadoop]$ pwd

/usr/local/hadoop/bin/

[hadoop@hc1nn bin]$ ./start-dfs.sh

[hadoop@hc1nn bin]$ ./start-mapred.sh

These commands change the directory to the /usr/local/hadoop/bin/ directory using the HADOOP_PREFIX

variable. The HDFS servers are started using the start-dfs.sh script, followed by the Map Reduce servers with start-

mapred.sh. At this point, you can begin the Nutch work, using Hadoop V1 on this cluster.

Architecture 1: Nutch 1.x

This first example illustrates how Nutch, Solr, and Hadoop work together. You will learn how to download, install, and

configure Nutch 1.8 and Solr, as well as how to set up your environment and build Nutch. With the prep work finished,

I'll walk you through running a sample Nutch crawl using Solr and then storing the data on the Hadoop file system.

Search WWH ::

Custom Search

Home