Database Reference
In-Depth Information
Chapter 3
Collecting Data with Nutch and Solr
Many companies collect vast amounts of data from the web by using web crawlers such as Apache Nutch. Available for
more than ten years, Nutch is an open-source product provided by Apache and has a large community of committed
users. An Apache Lucene open-source search platform, Solr can be used in connection with Nutch to index and
search the data that Nutch collects. When you combine this functionality with Hadoop, you can store the resulting
large data volume directly in a distributed file system.
In this chapter, you will learn a number of methods to connect various releases of Nutch to Hadoop. I will
demonstrate, though architectural examples, what can be accomplished by using the various tools and data.
Specifically, the chapter's first architectural example uses Nutch 1.8 configured to implicitly use the local Hadoop
installation. If Hadoop is available, Nutch will use it for storage, providing you with the benefits of distributed and
resiliant storage. It does not, however, give you much control over the selection of storage. Nutch will use either
Hadoop, if it is available, or the file system.
In the second architectural example, employing Nutch 2.x, you will be able to specify the storage used via Gora.
By explicitly selecting the storage method in the configuration options, you can gain greater control. This example
uses the HBase database, which still employs Hadoop for distributed storage. You then have the option of choosing a
different storage mechanism at a later date by altering the configuration.
Remember, although these examples are using small amounts of data, the architectures can scale to a high
degree to meet your big data-collection needs.
The Environment
Before we begin, you need to understand a few details about the environment in which we'll be working. This chapter
demonstrates the use of Nutch with Hadoop V1.2.1 because I could not get Nutch to build against Hadoop V2 at the
time of this writing. (Subsequently, I learned of a version of Nutch developed for YARN, but deadline constraints
prevented me from implementing it here.) Although in Chapter 2 you installed Cloudera CDH4 on the CentOS Linux
server hc1nn, at this point you'll need to switch back to using Hadoop V1. You'll manage this via a number of steps
that are explained in the following sections. A shortage of available machines is the only reason I have installed
multiple versions of Hadoop on a single cluster. This kind of multiple Hadoop installation is not appropriate for
project purposes.
Stopping the Servers
The Hadoop Cloudera CDH4 cluster servers may still be running, so they need to be stopped on all nodes in the
Hadoop cluster. Because these servers are Linux services, you need to stop them as the Linux root user. You carry out
the following steps on all servers in the cluster—in this case, hc1nn, hc1r1m1, hc1r1m2, and hc1r1m3.
 
Search WWH ::




Custom Search