Collecting Data with Nutch and Solr - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

The segments directory has a list of the segments that have been processed:

[hadoop@hc1nn nutch]$ hadoop fs -ls /user/hadoop/crawl/segments

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:08

/user/hadoop/crawl/segments/20140406140827

drwxr-xr-x - hadoop supergroup 0 2014-04-06 14:17

/user/hadoop/crawl/segments/20140406141732

You now have data in Solr. To see it, go to the Solr admin web page at http://localhost:8983/solr/ and select

“collection1” in the core selector drop-down menu (halfway down the left side). Figure 3-2 shows that the sample data

has loaded into Solr. Specifically, under the Replication heading on the right, you can see that around 10 KB of data

loaded from the short crawl of the Semtech Solutions web page. Although this is a comparatively small sum of data for

a large-scale distributed system, it serves to prove that the crawl executed and indexed correctly.

Figure 3-2. The Solr sample data after processing

To examine some of the actual data in Solr, you select the Query option (bottom left). An Execute Query option

will appear; select it to see the crawl results. Figure 3-3 shows a sample of the data that Solr has indexed from the

website at the single URL specified in the seed file.

Search WWH ::

Custom Search

Home