Java Reference
In-Depth Information
The program prompts the user to enter a URL string (line 6) and creates a URL object (line 9).
The constructor will throw a java.net.MalformedURLException (line 19) if the URL
isn't formed correctly.
The program creates a Scanner object from the input stream for the URL (line 11). If the
URL is formed correctly but does not exist, an IOException will be thrown (line 22). For
example, http://google.com/index1.html uses the appropriate form, but the URL itself does not
exist. An IOException would be thrown if this URL was used for this program.
MalformedURLException
12.38
How do you create a Scanner object for reading text from a URL?
Check
Point
12.13 Case Study: Web Crawler
This case study develops a program that travels the Web by following hyperlinks.
Key
Point
The World Wide Web, abbreviated as WWW, W3, or Web, is a system of interlinked hyper-
text documents on the Internet. With a Web browser, you can view a document and follow
the hyperlinks to view other documents. In this case study, we will develop a program that
automatically traverses the documents on the Web by following the hyperlinks. This type of
program is commonly known as a Web crawler . For simplicity, our program follows for the
hyperlink that starts with http:// . Figure 12.11 shows an example of traversing the Web.
We start from a Web page that contains three URLs named URL1 , URL2 , and URL3 . Following
URL1 leads to the page that contains three URLs named URL11 , URL12 , and URL13 . Follow-
ing URL2 leads to the page that contains two URLs named URL21 and URL22 . Following URL3
leads to the page that contains four URLs named URL31 , URL32 , and URL33 , and URL34 .
Continue to traverse the Web following the new hyperlinks. As you see, this process may
continue forever, but we will exit the program once we have traversed 100 pages.
Web crawler
Starting URL
URL1
URL2
URL3
URL1
URL2
URL3
URL11
URL21
URL31
URL12
URL22
URL32
URL13
URL33
URL4
…… …
…………
F IGURE 12.11
The client retrieves files from a Web server.
The program follows the URLs to traverse the Web. To ensure that each URL is traversed
only once, the program maintains two lists of URLs. One list stores the URLs pending for
traversing and the other stores the URLs that have already been traversed. The algorithm for
this program can be described as follows:
Add the starting URL to a list named listOfPendingURLs;
while listOfPendingURLs is not empty and size of listOfTraversedURLs
<= 100 {
 
 
 
Search WWH ::




Custom Search