EXCEPTION HANDLING AND TEXT I/O - Introduction to Java Programming: Comprehensive Version - page 484

Java Reference

In-Depth Information

The program prompts the user to enter a URL string (line 6) and creates a URL object (line 9).

The constructor will throw a java.net.MalformedURLException (line 19) if the URL

isn't formed correctly.

The program creates a Scanner object from the input stream for the URL (line 11). If the

URL is formed correctly but does not exist, an IOException will be thrown (line 22). For

example, http://google.com/index1.html uses the appropriate form, but the URL itself does not

exist. An IOException would be thrown if this URL was used for this program.

MalformedURLException

12.38

✓

✓

How do you create a Scanner object for reading text from a URL?

Check

Point

12.13 Case Study: Web Crawler

This case study develops a program that travels the Web by following hyperlinks.

Key

Point

The World Wide Web, abbreviated as WWW, W3, or Web, is a system of interlinked hyper-

text documents on the Internet. With a Web browser, you can view a document and follow

the hyperlinks to view other documents. In this case study, we will develop a program that

automatically traverses the documents on the Web by following the hyperlinks. This type of

program is commonly known as a Web crawler . For simplicity, our program follows for the

hyperlink that starts with http:// . Figure 12.11 shows an example of traversing the Web.

We start from a Web page that contains three URLs named URL1 , URL2 , and URL3 . Following

URL1 leads to the page that contains three URLs named URL11 , URL12 , and URL13 . Follow-

ing URL2 leads to the page that contains two URLs named URL21 and URL22 . Following URL3

leads to the page that contains four URLs named URL31 , URL32 , and URL33 , and URL34 .

Continue to traverse the Web following the new hyperlinks. As you see, this process may

continue forever, but we will exit the program once we have traversed 100 pages.

Web crawler

Starting URL

URL1

URL2

URL3

URL1

URL2

URL3

URL11

URL21

URL31

URL12

URL22

URL32

URL13

URL33

URL4

…… …

…

…

…………

F IGURE 12.11

The client retrieves files from a Web server.

The program follows the URLs to traverse the Web. To ensure that each URL is traversed

only once, the program maintains two lists of URLs. One list stores the URLs pending for

traversing and the other stores the URLs that have already been traversed. The algorithm for

this program can be described as follows:

Add the starting URL to a list named listOfPendingURLs;

while listOfPendingURLs is not empty and size of listOfTraversedURLs

<= 100 {

Next Page

Introduction to Java Programming: Comprehensive Version

Search WWH ::

Custom Search

Home