Web Directories for Information Organization on Web Portals

INTRODUCTION

Two methods are currently used to organize and retrieve information on the Internet. Search engines like Google and AltaVista use a robot-based keyword searching method by constructing inverted index files for Web pages and matching users’ query terms with the index terms. The other method organizes human-selected Internet resources into a searchable database, and gives users structured hierarchical access to the database in a similar way to browsing through library classification schemes. We call this structured hierarchical system a Web directory. Knowledge structures, like a library classification schema or a Web directory, visualize and reflect what people know about things, and help people understand things better, identify gaps, recognize patterns, predict future trends, and so forth (Kwasnik, 2005). Moreover, Web directories offer quality control and give access only to selected Internet resources. All these advantages make the browsing structure based on subject classification a desirable complement to the search engine type service (Koch, Day, Brummer, Hiom, Peereboom, Poulter, & Worsfold, 1997).

Since the first widely known Web directory was constructed by Yahoo! in 1993, many such directories have been built up. Even the most popular robot-based search engines, such as Google and AltaVista, are also maintaining their own directories. On the other hand, many researchers have been trying to use traditional library classification schemes, such as Dewey Decimal Classification, to organize Internet resources. In the Dewey Decimal Classification (DDC) Online Project, Markey demonstrated the first implementation of a library classification scheme for end-user subject access, browsing, and display (Vizine-Goetz, 1999). Currently, not only the international general classification schemes (also called universal classification schemes), such as DDC, Universal Decimal Classification (UDC) and Library of Congress Classification (LCC), are employed1, but also some national classification schemes2 and subject-specific classification schemes3. Koch, Day, Brummer et al. (1997) presented perhaps the most comprehensive study and comparison so far on the use of library classification schemes in organizing Internet information resources. They investigated three types of schemes, universal classification schemes, the national general schemes, and subject specific schemes, in terms of extent of usage, multilingual capability, strengths and weaknesses, integration between classification scheme and other systems (e.g. controlled subject headings), linking to third-party classification data, digital availability, copyright, and extensibility.

As Marcella and Newton noted, “the whole object of classification … is to create and preserve a subject order of maximum helpfulness to information seekers” (Van der Walt, 1998). At a time when both Internet-based classification schemes and traditional library classification systems are being used to provide access to Web resources, it is natural to compare the two and consider whether homegrown Web directories outperform the traditional library classification schemes in organizing information resources on the Internet. This will enable us to take advantage of their respective strengths and design more effective Web portals.

BACKGROUND

The literature about library classification exists in a huge volume. However, only a limited number of articles have addressed the topic of applying library classification schemes to organizing the information on the Internet. Likewise, not many authors have written about Web directories (compared to the vast pool of literature on automatic retrieval systems such as search engines). Even fewer have tried to juxtapose the two.

Among these trials, three articles are most related to the topic of this article. Van der Walt (1998) investigated some of the main structural features of the classification schemes used in the directories of search engines in order to determine whether they conform to the principles of library classification. The author examined 10 search engines at the main class level, analyzed the full hierarchies of a sample of three specific subjects in four of search engines, and identified a number of differences in the principles of constructing library classification scheme and Internet classifications. Ma (2001) compared the principles of designing traditional classification schemes and Web directories and pointed out some characteristics of the structure of Web directories. He noted that all the characteristics were determined by the Internet environment in which the directories functioned. Vizine-Goetz (1999) reviewed the major characteristics of DDC and LCC and assessed whether the electronic versions of these schemes could be successfully extended to the Internet. Through comparing Yahoo! and DDC classification, the author concluded with some recommendations for improvements that online library classification schemes will need to make if they are to be used in the Internet environment.

Besides, some authors wrote about the influence of Colon Classification on Web directories. Chen and Fan (1999) analyzed the classification system used in the Yahoo directory and noted that it had a close relationship with the Colon Classification idea proposed by famous Indian classification scientist Ranganathan. Chan (2000) quoted Aimee Glassel to analyze the application of Colon Classification to Yahoo! and noted that “both systems are based on combining facets to facilitate searching and maximize the number of relevant results.” It was argued that “Ranganathan’s ideas of classification are more applicable now than before in the Internet environment.”

In this article, the author will study the structure of current Web directories and compare it with major universal library classifications. Focus will be on their main classes with some additional discussions on hierarchical structures. The study does not emphasize a specific Web directory or a specific library classification scheme; instead, it refers to a number of Web directories and library classification schemes as examples to support the arguments. Considering the scope of this article, only the comprehensive Web directories used in major Web portals and the universal classification schemes (like DDC, LCC and UDC) will be studied.

WEB DIRECTORIES VS. LIBRARY CLASSIFICATIONS

Comparison

Web Directories

Figures 1 and 2 display the first page of Yahoo! directory and Google directory, and Table 1 is a mapping between them. It can be easily noted that the two classification schemes match quite well at the main class level. All the Yahoo! main classes except “Government” can find their counterparts in Google main classes. Conversely, all the Google main classes except “Home” and “Kids and Teens” have their counterparts at Yahoo!’s main classes. As a matter of fact, the main classes in most other Web directories are organized in a similar way, so the differences between Web directories can be neglected when comparison is made with traditional library classifications.

Library Classification Schemes

Figure 3 and Figure 4 display the main classes of Dewey Decimal Classification scheme and Library of Congress Classification scheme. Table 2 compares the two. Again, the main classes in these two classification schemes match quite well with each other. Differences between them will also be neglected when they are compared with Web directories.

Figure 1. The main classes in Yahoo! Directory

The main classes in Yahoo! Directory

Figure 2. The main classes in Google Directory

The main classes in Google Directory

What is Different?

Unlike the high degree of consistency between Web directory main classes and between library classification main classes, a comparison between the Yahoo! classification and the LCC scheme reveals tremendous differences at main class level.

The first, and the most obvious difference is that only four of the Yahoo! main classes coincide with main classes in the UDC scheme or LCC scheme: “Arts & Humanities,” “Science,” and “Social Science” in Yahoo! with 700 “The Google do not map with any specific class in DDC or LCC. The reason for this discrepancy will be discussed.

Table 1. Mapping between Yahoo! and Google directories


Google

Yahoo!

Arts

Arts & Humanities

Business/Shopping

Business & Economy/(B2B, Finance, Shopping, Jobs.)

Computers

Computers & Internet

Games

Entertainment

Health

Health

Home

 

Kids and Teens

 

News

News and Media

Recreation (Education, Libraries, Maps,.. ,)/Sports

Recreation & Sports/ Education

Regional/World

Regional

Science

Science/Social Science

Society

Society & Culture

Government

Figure 3. The main classes in DDC 22, published in mid-2003

The main classes in DDC 22, published in mid-2003

Figure 4. The main classes in Library of Congress Classification

The main classes in Library of Congress Classification

A second major difference between the LCC scheme and Yahoo! classification concerns the principles of division used to form the main classes (to form lower level classes as well). It is well known that library schemes follow the basic principle of classification by discipline (logical division). At least half of the terms used in the LCC scheme can be described as discipline, such as Agriculture, History,

Table 2. Mapping between LCC and UDC at main class level

LLC

UDC

A—GENERAL WORKS

000 Generalities

B—PHILOSOPHY.PSYCHOLOGY. RELIGION

100 Philosophy and psychology 200 Religion

C-F—HISTORY

900 Geography and history

G—GEOGRAPHY. ANTHROPOLOGY. RECREATION

H—SOCIAL SCIENCES

300 Social sciences

J—POLITICAL SCIENCE

K—LAW

L—EDUCATION

M—MUSIC AND BOOKS ON MUSIC

700 The arts

N—FINE ARTS

P—LANGUAGE AND LITERATURE

400 Language 800 Literature and rhetoric

Q—SCIENCE

500 Natural sciences and mathematics

R—MEDICINE

600 Technology (applied sciences)

S—AGRICLTURE

T—TECHNOLOGY

 

U—MILITARY SCIENCE

V—NAVAL SCIENCE

Z—BIBLIOGRAPHY.LIBRARY SCIENCE. INFORMATION RESOURCES

 

Geography, Education, Law, and Psychology, or as groups of related disciplines, such as the Arts, Natural Sciences, Social Sciences, and Technology. However, an analysis of terms used in the main classes of Yahoo! and Google classification reveals that they represent a number of conceptual categories used as principle of division, including:

• disciplines or group of disciplines, such as “Arts & Humanities,” “Education,” “Social Science,” and “Science”;

• broad to relatively specific subjects, such as ” Computers,,” “Government,,” “Internet,” and “Shopping”;

• bibliographic form, such as “News,” “Reference,” and “Media”;

• geographic concepts, such as “World,” and “Regional”;

• target audience, such as “Kids and Teens”"

Obviously, the classes at a specific level are not mutually exclusive, which is a deviation from the accepted logical principle of classification. It will inevitably cause uncertainty for the users when they have to select a category to look for information. For instance, if people are interested in universities in the United Kingdom, where should they start, “Education”" or “Region’”?

The third difference concerns the class headings. Although some terms denoting disciplines are used as main class headings in the Web directories, the general tendency is to prefer terms for objects of study such as “Computers”" and “Games”" or activities such as “Shopping”" rather than the names of fields of study. Sometimes the discipline is even used as a subdivision under the object of study. For example, “Library and Information Science’” comes under “Libraries”" in Yahoo! directory.

Why Different?

In one word, the major differences between the main classes in the Web directories and the library classifications root from the different approaches in which they are designed.

The library classifications follow a discipline-based approach in designing main classes (and subordinate classes). Each division follows the logical rules of classification (i.e., totally inclusive and mutually exclusive). However, the Web directories use a concept-based approach. The distribution of information resources and the frequency of usage are the rules of thumb when deciding main classes. On the Internet, business information and entertainment information take the lion’s share, while the academic information, which is abundant in library collections, is in a minor position. Therefore, most of the main classes in the Web directories address daily life topics such as business, entertainment, recreation & sports, and health, while the classes for academic resources are combined into groups that are larger than those in library classifications. The designers of Web directories also adjust the main classes according to the usage of resources in the class. Popular topics, such as computer and Internet, shopping, and games, gain higher status in the hierarchical structure because they are searched by more users; thus, putting them in the first page of the classification can save the users’ time on average.

Which is Better?

Considering the distribution of information resources and the frequency of usage when constructing the main classes in the Web directories is in line with the widely recognized library science principle of “literary warrant” and “use warrant.” It has the advantage of not scattering related materials in the way a discipline-based scheme typically does. In addition, Web directories use more popular, everyday terms as class headings, which cater to the users in the Internet environment.

On the other hand, however, violation of the basic logical rules in Web directories makes it difficult for a Web user to choose the access point in the hierarchical structures when looking for information. Although this can be partly adjusted through cross references in subordinate hierarchies (as discussed next), the author believes that basic logical rules followed in library classifications still need to be carefully observed in designing Internet classifications, especially at higher levels. Adequate evidence of extensive resources or usage must be collected before any adjustment of class level is made to avoid illogical hierarchical structure. The practice of putting “Education” under “Reference” in one Web directory is at least a puzzling one, if not illogical.

Some Discussions on Hierarchical Structures

Hierarchical subdivision, progressing from the broadest to the most specific class headings, is one of the most basic structures of any classification scheme (Van der Walt, 1998). As mentioned before, the concept-based vs. discipline-based approaches define the major differences between library classifications and Web directories at the main class level. Such differences continue at lower levels in hierarchical division.

Figure 5 shows the second-level division under the main class “arts” in Open Directory Project, a well known Web directory. Art forms (e.g., architecture, comics, crafts, dance, music, etc.), media (e.g. radio, television, etc.), artistic methods (e.g., animation, costumes, etc.), topics in arts research (arts history, classical studies), and so on, are employed as principles of division at this level. This not only causes confusion when a browser is to choose a path to go down the hierarchical structure for information, but also brings the problem of representing the horizontal relationship between classes.

In the library environment, a resource (periodical, etc.) can only be restored in one physical location. This determines the linear structure of library classifications. With all the classes at the same level being mutually exclusive, a resource will either go under this class or that one. By no means can they be grouped into more than one category. With the development of science and technology, however, more and more interdisciplinary areas come into existence. Very often, a topic (e.g. bioinformatics) has a logical relationship with multiple upper-level concepts. In this case, library classifications use cross references indicated by “see” or “see also” to provide multiple access points for these resources and reflect the horizontal relationships between terms, drawing the users from all the possible logical places of a resource to its physical location (on the bookshelf). In the Web environment, such a task becomes much easier. Hyperlinks bring users to the actual headings where Web sites are listed at a click of mouse. In Yahoo! directory, for example, if users want to look for information on recreation and sports TV shows, they can either start from “recreation and sports’” or “entertainment.” Starting from “recreation and sports,” they can notice a subclass “television@” at the second level, which links to “recreation and sports” under the “television shows” subclass in the “entertainment’ main class. The following notation shows the different paths:

Recreation > Television vs. Entertainment > Television Shows > Recreation and Sports

Obviously, using the hyperlink technology to deal with the horizontal relationship between classes offers great flexibility in organizing resources. But meanwhile, caution needs to be taken that the technology is not overused. In some existing Internet classifications, hyperlinks are used randomly and citation order is changed from place to place. This, on the one hand, puts the classification into the danger of logical chaos; on the other hand, it increases the difficulty for the users (and subsequent classifiers as well) to get familiar with the hierarchical structure. In this regard, a balance should be sought between the rigid, but neat, partitioning of the information space brought by library classifications and the flexibility offered by Internet classifications.

Figure 5. Subdivisions under “Arts” in Open Directory Project

Subdivisions under "Arts" in Open Directory Project

CONCLUSION AND FUTURE TRENDS

Conclusion

The comparison between Web directories and library classifications leads to several important findings. First, the Internet classifications better reflect the distribution of information resources on the Internet and frequency of usage. By using a topic-based approach in designing classification hierarchies, they do not scatter related materials in the way discipline-based library classification schemes typically do. Second, Web directories use more popular terms, as class headings, that correspond to the kind of information the majority of users search on the Internet. Therefore, they can be better received. Third, the Web technology enables the Web directories to easily offer multiple access points to users looking for information. This flexibility greatly saves the users’ burden in deciding a single starting point or shifting between various possible access terms as in library classifications. In these aspects, the Web directories work better than library classification schemes in organizing and providing improved access to Internet resources.

On the other hand, library classifications have the advantage of logical soundness. With only one principle used in each division, all the classes at a specific level are mutually exclusive. Therefore, the hierarchical structure is neat and clear. In addition, with constant revisions over several decades, the major universal classifications, like DDC and

LCC and some subject specific classification schemes, offer a valuable depiction of the structure of knowledge. They are certainly ideal places to gain inspirations for the designers of Internet classifications.

Future Trends

To combine the strength of both classifications, a possible improvement will be to use different approaches to serve people with different types of information-seeking tasks. If someone looks for information to satisfy his/her day-to-day needs and interests, the topic-based approach (as in most existing Web directories) may be appropriate, so long as it follows a clear principle of division and consistent citation order all through the hierarchy. On the other hand, the interests of serious academic and professional users will probably be better served by means of a discipline-based classification, such as the library classification schemes. Further research is needed to find out how the browsing structure influences different types of users in their information-seeking behavior.

Another recommendation is to construct more topic-specific clearinghouses, instead of all-inclusive Web portals. It should be clearly understood that using human-constructed directories would inevitably sacrifice the comprehensiveness of information. When weighing the impact of this sacrifice, we must again consider the characteristics of the Web and the needs of the users. Unlike a doctoral student who scours all available library collections to exhaust the coverage on a topic, most Web users often want just a few good results every time they search the Web. Topical clearinghouses that point to quality information are designed to serve such information needs, and may also hold even more entries for their subjects than are available through comprehensive indexes (Hubbard, 1999). Designers of Web portals can therefore spend more effort collecting high-quality topical clearinghouses and organizing them in well-defined classification structures, instead of organizing the entire Web resources by themselves.

KEY TERMS

Classification: Classification is the partitioning of experience into meaningful clusters.

Information Retrieval: Information retrieval is the art and science of searching for information in documents, searching for documents themselves, searching for metadata that describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images, or data.

Library Classification: A library classification is a system of coding and organizing library materials (serials, audiovisual materials, computer files, maps, manuscripts, etc.) according to their subject. A classification consists of tables of subject headings and classification schedules used to assign a class number to each item being classified, based on that item’s subject.

Search Engine: Internet search engines (e.g., Google, AltaVista) help users find Web pages on a given subject.

The search engines maintain databases of Web sites and use programs (often referred to as “spiders” or “robots”) to collect information, which is then indexed by the search engine.

Subject Heading: A word or phrase, from a controlled vocabulary, that is used to describe the subject of a document. The most commonly used subject headings in libraries are the Library of Congress Subject Headings (LCSH).

Web Directory: A Web directory is a Web-based catalog of information, typically organized by human editors. A directory is to the Internet as the table of contents is to a topic. Directories also include white and yellow pages for finding people and businesses, to specialized directories for individual subjects and markets.

Web Portal: A Web portal is a Web site that provides a starting point or gateway to other resources on the Internet or an intranet.

Next post:

Previous post: