Java Reference
In-Depth Information
Remove a URL from listOfPendingURLs;
if this URL is not in listOfTraversedURLs {
Add it to listOfTraversedURLs;
Display this URL;
Read the page from this URL and for each URL contained in the page {
Add it to listOfPendingURLs if it is not in listOfTraversedURLs;
}
}
}
Listing 12.18 gives the program that implements this algorithm.
L
ISTING
12.18
WebCrawler.java
1
import
java.util.Scanner;
2
import
java.util.ArrayList;
3
4
public class
WebCrawler {
5
public static void
main(String[] args) {
6 java.util.Scanner input =
new
java.util.Scanner(System.in);
7 System.out.print(
"Enter a URL: "
);
8 String url = input.nextLine();
9
enter a URL
craw from this URL
crawler(url);
// Traverse the Web from the a starting url
10 }
11
12
public static void
crawler(String startingURL) {
13 ArrayList<String> listOfPendingURLs =
new
ArrayList<>();
14 ArrayList<String> listOfTraversedURLs =
new
ArrayList<>();
15
16 listOfPendingURLs.add(startingURL);
17
while
(!listOfPendingURLs.isEmpty() &&
18 listOfTraversedURLs.size() <=
100
) {
19 String urlString = listOfPendingURLs.remove(
0
);
20
if
(!listOfTraversedURLs.contains(urlString)) {
21 listOfTraversedURLs.add(urlString);
22 System.out.println(
"Craw "
+ urlString);
23
24
list of pending URLs
list of traversed URLs
add starting URL
get the first URL
URL traversed
for
(String s: getSubURLs(urlString)) {
25
if
(!listOfTraversedURLs.contains(s))
26
listOfPendingURLs.add(s);
add a new URL
27 }
28 }
29 }
30 }
31
32
public static
ArrayList<String> getSubURLs(String urlString) {
33 ArrayList<String> list =
new
ArrayList<>();
34
35
try
{
36 java.net.URL url =
new
java.net.URL(urlString);
37 Scanner input =
new
Scanner(url.openStream());
38
int
current =
0
;
39
while
(input.hasNext()) {
40 String line = input.nextLine();
41 current = line.indexOf(
"http:"
, current);
42
while
(current >
0
) {
43
int
endIndex = line.indexOf(
"\""
, current);
44
if
(endIndex >
0
) {
// Ensure that a correct URL is found
45 list.add(line.substring(current, endIndex));
46 current = line.indexOf(
"http:"
, endIndex);
47 }
read a line
search for a URL
end of a URL
URL ends with "
extract a URL
search for next URL
Search WWH ::
Custom Search