Java Reference
In-Depth Information
48 else
49 current = - 1 ;
50 }
51 }
52 }
53 catch (Exception ex) {
54 System.out.println( "Error: " + ex.getMessage());
55 }
56
57
return URLs
return list;
58 }
59 }
Enter a URL: http://cs.armstrong.edu/liang
Enter a URL: http://www.cs.armstrong.edu/liang
Craw http://www.cs.armstrong.edu/liang
Craw http://www.cs.armstrong.edu
Craw http://www.armstrong.edu
Craw http://www.pearsonhighered.com/liang
...
The program prompts the user to enter a starting URL (lines 7-8) and invokes the
crawler(url) method to traverse the web (line 9).
The crawler(url) method adds the starting url to listOfPendingURLs (line 16) and
repeatedly processes each URL in listOfPendingURLs in a while loop (lines 17-29). It
removes the first URL in the list (line 19) and processes the URL if it has not been processed
(lines 20-28). To process each URL, the program first adds the URL to listOfTraversed-
URLs (line 21). This list stores all the URLs that have been processed. The getSubURLs(url)
method returns a list of URLs in the Web page for the specified URL (line 24). The program
uses a foreach loop to add each URL in the page into listOfPendingURLs if it is not in
listOfTraversedURLs (lines 24-26).
The getSubURLs(url) method reads each line from the Web page (line 40) and searches
for the URLs in the line (line 41). Note that a correct URL cannot contain line break charac-
ters. So it is sufficient to limit the search for a URL in one line of the text in a Web page. For
simplicity, we assume that a URL ends with a quotation mark " (line 43). The method obtains
a URL and adds it to a list (line 45). A line may contain multiple URLs. The method continues
to search for the next URL (line 46). If no URL is found in the line, current is set to -1 (line
49). The URLs contained in the page are returned in the form of a list (line 57).
The program terminates when the number of traversed URLs reaches to 100 (line 18).
This is a simple program to traverse the Web. Later you will learn the techniques to make
the program more efficient and robust.
12.39
Before a URL is added to listOfPendingURLs , line 25 checks whether it has been
traversed. Is it possible that listOfPendingURLs contains duplicate URLs? If so,
give an example.
Check
Point
K EY T ERMS
absolute file name
473
exception 450
exception propagation
chained exception
469
459
checked exception
457
relative file name 473
throw exception 452
unchecked exception
declare exception
458
directory path
473
457
 
 
Search WWH ::




Custom Search