Java Reference
In-Depth Information
Remove a URL from listOfPendingURLs;
if this URL is not in listOfTraversedURLs {
Add it to listOfTraversedURLs;
Display this URL;
Read the page from this URL and for each URL contained in the page {
Add it to listOfPendingURLs if it is not in listOfTraversedURLs;
}
}
}
Listing 12.18 gives the program that implements this algorithm.
L ISTING 12.18
WebCrawler.java
1 import java.util.Scanner;
2 import java.util.ArrayList;
3
4 public class WebCrawler {
5 public static void main(String[] args) {
6 java.util.Scanner input = new java.util.Scanner(System.in);
7 System.out.print( "Enter a URL: " );
8 String url = input.nextLine();
9
enter a URL
craw from this URL
crawler(url); // Traverse the Web from the a starting url
10 }
11
12 public static void crawler(String startingURL) {
13 ArrayList<String> listOfPendingURLs = new ArrayList<>();
14 ArrayList<String> listOfTraversedURLs = new ArrayList<>();
15
16 listOfPendingURLs.add(startingURL);
17 while (!listOfPendingURLs.isEmpty() &&
18 listOfTraversedURLs.size() <= 100 ) {
19 String urlString = listOfPendingURLs.remove( 0 );
20 if (!listOfTraversedURLs.contains(urlString)) {
21 listOfTraversedURLs.add(urlString);
22 System.out.println( "Craw " + urlString);
23
24
list of pending URLs
list of traversed URLs
add starting URL
get the first URL
URL traversed
for (String s: getSubURLs(urlString)) {
25
if (!listOfTraversedURLs.contains(s))
26
listOfPendingURLs.add(s);
add a new URL
27 }
28 }
29 }
30 }
31
32 public static ArrayList<String> getSubURLs(String urlString) {
33 ArrayList<String> list = new ArrayList<>();
34
35 try {
36 java.net.URL url = new java.net.URL(urlString);
37 Scanner input = new Scanner(url.openStream());
38 int current = 0 ;
39 while (input.hasNext()) {
40 String line = input.nextLine();
41 current = line.indexOf( "http:" , current);
42 while (current > 0 ) {
43 int endIndex = line.indexOf( "\"" , current);
44 if (endIndex > 0 ) { // Ensure that a correct URL is found
45 list.add(line.substring(current, endIndex));
46 current = line.indexOf( "http:" , endIndex);
47 }
read a line
search for a URL
end of a URL
URL ends with "
extract a URL
search for next URL
 
Search WWH ::




Custom Search