World Wide Web (Data Communications and Networking)

The Web was first conceived in 1989 by Sir Tim Berners-Lee at the European Particle Physics Laboratory (CERN) in Geneva. His original idea was to develop a database of information on physics research, but he found it difficult to fit the information into a traditional data-base. Instead, he decided to use a hypertext network of information. With hypertext, any document can contain a link to any other document.

CERN’s first Web browser was created in 1990, but it was 1991 before it was available on the Internet for other organizations to use. By the end of 1992, several browsers had been created for UNIX computers by CERN and several other European and American universities, and there were about 30 Web servers in the entire world. In 1993, Marc Andreessen, a student at the University of Illinois, led a team of students that wrote Mosaic, the first graphical Web browser, as part of a project for the university’s National Center for Supercomputing Applications (NCSA). By the end of 1993, the Mosaic browser was available for UNIX, Windows, and Macintosh computers, and there were about 200 Web servers in the world. In 1994, Andreessen and some colleagues left NCSA to form Netscape, and a half a dozen other startup companies introduced commercial Web browsers. Within a year, it had become clear that the Web had changed the face of computing forever. NCSA stopped development of the Mosaic browser in 1996, as Netscape and Microsoft began to invest millions to improve their browsers.

How the Web Works

The Web is a good example of a two-tier client-server architecture (Figure 2.9). Each client computer needs an application layer software package called a Web browser.

Figure 2.9 How the Web works

There are many different browsers, such as Microsoft’s Internet Explorer. Each server on the network that will act as a Web server needs an application layer software package called a Web server. There are many different Web servers, such as those produced by Microsoft and Apache.

To get a page from the Web, the user must type the Internet uniform resource locator (URL) for the page he or she wants (e.g., www.yahoo.com) or click on a link that provides the URL. The URL specifies the Internet address of the Web server and the directory and name of the specific page wanted. If no directory and page are specified, the Web server will provide whatever page has been defined as the site’s home page.

For the requests from the Web browser to be understood by the Web server, they must use the same standard protocol or language. If there were no standard and each Web browser used a different protocol to request pages, then it would be impossible for a Microsoft Web browser to communicate with an Apache Web server, for example.

The standard protocol for communication between a Web browser and a Web server is Hypertext Transfer Protocol (HTTP).1 To get a page from a Web server, the Web browser issues a special packet called an HTTP request that contains the URL and other information about the Web page requested (see Figure 2.9). Once the server receives the HTTP request, it processes it and sends back an HTTP response, which will be the requested page or an error message (see Figure 2.9).

This request-response dialogue occurs for every file transferred between the client and the server. For example, suppose the client requests a Web page that has two graphic images. Graphics are stored in separate files from the Web page itself using a different file format than the HTML used for the Web page (in JPEG [Joint Photographic Experts Group] format, for example). In this case, there would be three request-response pairs. First, the browser would issue a request for the Web page, and the server would send the response. Then, the browser would begin displaying the Web page and notice the two graphic files. The browser would then send a request for the first graphic and a request for the second graphic, and the server would reply with two separate HTTP responses, one for each request.

Inside an HTTP Request

The HTTP request and HTTP response are examples of the packets we introduced in last topic 1 that are produced by the application layer and sent down to the transport, network, data link, and physical layers for transmission through the network. The HTTP response and HTTP request are simple text files that take the information provided by the application (e.g., the URL to get) and format it in a structured way so that the receiver of the message can clearly understand it.

An HTTP request from a Web browser to a Web server has three parts. The first two parts are required; the last is optional. The parts are:

• The request line, which starts with a command (e.g., get), provides the Web page and ends with the HTTP version number that the browser understands; the version number ensures that the Web server does not attempt to use a more advanced or newer version of the HTTP standard that the browser does not understand.

• The request header, which contains a variety of optional information such as the Web browser being used (e.g., Internet Explorer) and the date.

• The request body, which contains information sent to the server, such as information that the user has typed into a form.

Figure 2.10 shows an example of an HTTP request for a page on our Web server, formatted using version 1.1 of the HTTP standard. This request has only the request line and the request header, because no request body is needed for this request. This request includes the date and time of the request (expressed in Greenwich Mean Time [GMT], the time zone that runs through London) and name of the browser used (Mozilla is the code name for the browser). The "Referrer" field means that the user obtained the URL for this Web page by clicking on a link on another page, which in this case is a list of faculty at Indiana University (i.e., www.indiana.edu/~isdept/faculty.htm). If the referrer field is blank, then it means the user typed the URL him- or herself. You can see inside HTTP headers yourself at www.rexswain.com/httpview.html.

Inside an HTTP Response

The format of an HTTP response from the server to the browser is very similar to the HTTP request. It, too, has three parts, with the first required and the last two optional:

• The response status, which contains the HTTP version number the server has used, a status code (e.g., 200 means "okay"; 404 means "not found"), and a reason phrase (a text description of the status code).

• The response header, which contains a variety of optional information, such as the Web server being used (e.g., Apache), the date, and the exact URL of the page in the response.

• The response body, which is the Web page itself.

Figure 2.10 An example of a request from a Web browser to a Web server using the HTTP (Hypertext Transfer Protocol) standard

Figure 2.11 shows an example of a response from our Web server to the request in Figure 2.10. This example has all three parts. The response status reports "OK," which means the requested URL was found and is included in the response body. The response header provides the date, the type of Web server software used, the actual URL included in the response body, and the type of file. In most cases, the actual URL and the requested URL are the same, but not always. For example, if you request an URL but do not specify a file name (e.g., www.indiana.edu), you will receive whatever file is defined as the home page for that server, so the actual URL will be different from the requested URL.

Figure 2.11 An example of a response from a Web server to a Web browser using the HTTP standard

Free Speech Reigns on the 2.2 Internet … or Does It?

MANAGEMENT FOCUS

In a landmark decision in 1997, the U.S. Supreme Court ruled that the sections of the 1996 Telecommunications Act restricting the publication of indecent material on the Web and the sending of indecent e-mail were unconstitutional. This means that anyone can do anything on the Internet, right?

Well, not really. The court decision affects only Internet servers located in the United States. Each country in the world has different laws that govern what may and may not be placed on servers in their country. For example, British law restricts the publication of pornography, whether on paper or on Internet servers.

Many countries such as Singapore, Saudi Arabia, and China prohibit the publication of certain political information. Because much of this ”subversive” information is published outside of their countries, they actively restrict access to servers in other countries.

Other countries are very concerned about their individual cultures. In 1997, a French court convicted Georgia Institute of Technology of violating French language law. Georgia Tech operates a small campus in France that offers summer programs for American students. The information on the campus Web server was primarily in English because classes are conducted in English. This violated the law requiring French to be the predominant language on all Internet servers in France.

The most likely source of problems for North Americans lies in copyright law. Free speech does not give permission to copy from others. It is against the law to copy and republish on the Web any copyrighted material or any material produced by someone else without explicit permission. So don’t copy graphics from someone else’s Web site or post your favorite cartoon on your Web site, unless you want to face a lawsuit.

The response body in this example shows a Web page in Hypertext Markup Language (HTML). The response body can be in any format, such as text, Microsoft Word, Adobe PDF, or a host of other formats, but the most commonly used format is HTML. HTML was developed by CERN at the same time as the first Web browser and has evolved rapidly ever since. HTML is covered by standards produced by the IETF, but Microsoft keeps making new additions to HTML with every release of its browser, so the HTML standard keeps changing.