Operational Aspects (Digital Library)

Certain aspects of a digital library site need to be determined individually for each installation. When Greenstone is installed, default values are given to many parameters. Some, like the directory where the software is kept and the HTTP address of key folders, define the whereabouts of the system— what logical space it occupies. Others control what users can do with the system—such as the languages that the user interface makes available—and switch on various components. You need to know what facilities exist and how to turn them on and off.

Whether logs of user activity are kept and whether Web browser cookies are used to identify users are other important operational aspects. Greenstone has a full logging capability, but it is switched off by default to avoid the creation of large, growing files. In many environments it is essential to record usage logs in order to justify the existence of the system.

Greenstone includes a Web-based administration facility that gives information about the entire system, including all collections it offers. This facility is off by default. Perhaps its most useful function is to allow the system administrator to define users and groups of users who are allowed different privileges—such as access to protected collections and documents within collections. Password protection is just one of many techniques used to bolster security when operating in a network environment, especially if the computer is connected to the Internet. Depending on the level of sensitivity of the digital library content, you may wish to consider using additional measures, such as protecting the computer behind a firewall and streaming all communication over HTTPS rather than HTTP.


Configuration files

There are two configuration files: a site configuration file (gsdlsite.cfg), which is found in Greenstone’s cgi-bin folder, and a main configuration file (main.cfg), found in the top-level etc folder. Both are ordinary text files that can be edited to tailor the individual installation. Moreover, they contain extensive comments that describe what the options are and how to use them.

Once the software has been installed and is working, you are unlikely to need to change the site configuration file—unless you move all the files or change HTTP addresses. The options available in the main configuration file are more interesting. We will not go into details here—you can easily look at the actual file itself for this information—but here is a synopsis of what can be done:

• Log all usage (see next subsection).

• Use cookies to identify users in the logs.

• Enable the administration facility (see next subsection).

• Enable the institutional repository component (see Section 11.6).

• Enable the Remote Librarian applet (see Section 11.6).

• Select which languages are enabled in the user interface.

• Determine the encodings to be used for the user interface.

• Set defaults for built-in CGI arguments—for example, the default interface language.

The best way to learn about the configuration options is to experiment with the main.cfg file itself. Changes take effect immediately unless you are using the Windows Local Library version (the default for Windows), in which case the server must be restarted before configuration changes take effect.

Encoding statements specify different types of character encoding that can be selected. The UTF-8 version of Unicode (see Sections 4.1 and 8.2), which has standard ASCII as a subset, is handled internally and should always be enabled. But there are many other possible encodings—for example, traditional Chinese text is often represented in "Big-5." The main configuration file specifies many encodings; most are disabled but can be restored by removing the comment character (#). The main configuration file also contains detailed documentation about the structure of encoding statements. Logging

Three kinds of log are maintained in the etc folder: a usage log, an error log, and an events log. The first is the most interesting. The error log, which is permanently enabled, contains messages relating to initialization and operational errors: it is of interest only to people maintaining the software. The events log relates to an obsolete subsystem and will not be discussed.

All user activity—every page that each user visits—can be recorded in the usage log (etc\usage.txt), although no personal names are included. Each action is effectively defined by the arguments in the URL ("CGI arguments") that characterize it, and these are what are logged. Disabled by default, logging is enabled by switches: one switch turns logging on and off, and another assigns unique identification codes (cookies) to users, which enables their interactions to be traced through the log file.

Each line in the log file records a single page visit. Entries have a time-stamp, the address of the user’s computer, details about Web browser used, and the arguments that the CGI mechanism transmits to Greenstone. The main configuration file also includes a switch that sets the format used for the time-stamp: local time in the format "Fri Oct 17 15:57:28 NZDT 2008," Greenwich Mean Time (UTC) in the same format, or an integer representing the number of seconds since 01/01/1970 GMT.

Entry in the usage log

Figure 11.2: Entry in the usage log

Figure 11.2 shows a sample entry, split into these components. On 17 Oct 2008 a user at massey. ac.nz displayed a page (action a=p) that is the home page (page p=home) of the Maori newspaper collection (collection c=niupepa). Many of the other arguments have default values—for example, the language is English (l=en) and 20 search results will be displayed per page (o=20). The user’s browser is Firefox. The last argument, z, is a cookie generated by the user’s browser: it contains the computer’s IP number followed by the time that the user first accessed the digital library. (The z argument appears only if cookies are enabled in Greenstone.)

When logging is enabled, every action by every user is logged—even the Web pages generated to inspect the log files.

Administration facility

Greenstone’s administration pages display the installation’s configuration files and allow them to be modified. They let you examine the log that records usage and the log that records internal errors. They are available over the Web, so you can use them anywhere. However, the facilities are rudimentary, because Web forms are used for interaction. If you need to edit or examine these files, it is probably best to log into the computer that is serving Greenstone and work with an ordinary text editor. If the hyperlinked button Administration Page does not appear on the home page (beneath the available collections), then you must edit the main configuration file main.cfg to enable the administration facility—simply locate the appropriate line (search for Administration) and change the status value from disabled to enabled.

The facility is most useful when you need to define user groups with different privileges. For example, it is possible to restrict access to certain documents, or to certain collections, to particular users. Also, the Remote Librarian interface authenticates users before allowing them to alter the structure and content of particular collections, as does the institutional repository facility (both are described in Section 11.6). Of course, the ability to define new users and user groups is restricted to people who have been authorized to act as system administrator.

Authentication

When the Greenstone software is installed, there is the option to create a user called admin and set its password. Since this means that a malicious user (from anywhere in the world) who has been able to crack the password has the potential to wreak havoc with your digital library, we advise caution in using this feature. This is also why the feature is off by default in the installer, requiring a conscious decision by the person installing the software to activate it.

In order to investigate the authentication scheme, enable it by editing the main configuration file, go to the Greenstone home page, and refresh the browser window or restart the Local Library server if that is what you are using. A new line appears on the home page, beneath the collections, that refers to the administration facility. Click the button that leads to the Administration page.

On the left of this page are menu items for configuration files, logs, user management, and technical information. User management is the most useful: it allows you to list users, add new users, and change your password. If you attempt any of these, you will need to sign in as the admin user. If not set at installation time, the default password is admin.

Each user can belong to any number of groups. When the software is installed, there are three groups: administrator, all-collections-editor, and demo. Members of the first group can add and remove users and change their groups; the second group is connected with the Remote Librarian facility discussed in Section 11.6; while the third group is mentioned in the next section. Groups are simply text strings, and you can add them at will: just type them into the "groups" box associated with each user. The admin user can also disable users if they misbehave. Information about users, passwords, and groups is recorded in a database in the Greenstone file structure (etc\users.gdb); passwords are encrypted using the well-known crypt utility.

Protecting a collection

It is sometimes necessary to protect digital library collections, or certain parts of them, from the public eye. For example, it may be necessary to restrict PDF files to use only within an organization but allow open access to the extracted HTML, or to keep images private but provide open access to thumbnails. The safest way to do this is to use the authentication facilities of your Web server. Most popular Web servers (e.g., Apache) can be configured to protect parts of the file system so that private files placed there cannot be accessed unless users authenticate themselves first.

Greenstone’s authentication scheme is another way of controlling access to particular documents or collections. While less powerful than the capabilities provided by a Web server like Apache, it does have the advantage that it is much easier to learn. Authentication works in two stages. First determine what to restrict access to; second, if access is to be restricted, determine which users are to have it. Access can be restricted either to the collection as a whole or to individual documents in it. In the latter case, the documents are specified individually.

The Librarian interface does not yet provide this ability; instead, you have to edit the relevant collection’s configuration file manually (collect.cfg in the collection’s etc folder). Authentication is activated by a line that begins with the directive authenticate and is followed by collection or document depending on whether it applies to the full collection or to individual documents (the default is collection). If authentication is by document, you can specify a list of private documents or a list of public documents. In the former case, all other documents are public; in the latter, all others are private. This is done using directives private_documents or public_documents—use one or the other, but not both. The documents themselves are specified using identifiers separated by spaces.

authentication lines from a collection configuration file

Figure 11.3: authentication lines from a collection configuration file 

authentication in Greenstone

Figure 11.4: authentication in Greenstone

Figure 11.3 shows an example of the per-document authentication. It uses Greenstone document identifiers, but other identifiers can be specified when building the collection (see Section 11.4). The easiest way to determine the identifier for a given document is to locate it in the collection and look at the d argument in its Greenstone URL.

The second part of the process uses the auth_groups directive to specify the groups that are permitted access. It is followed by a group name (or a list of group names separated by spaces). The lines from the collection configuration file in Figure 11.3 restrict access to all documents except two to members of the demo group; those two documents are public. As noted above, you can define groups and add members to them from the administration pages.

When users try to access a protected document, they are asked for a user name and password as shown in Figure 11.4. This screenshot, and Figure 11.3, are taken from the Authentication Demo collection, available at nzdl.org. Specifying access control this way is a clunky feature that is little used in practice. Instead of having to specify an explicit list of document identifiers, it would be better to control authentication using a metadata value, and this facility is planned for a future version of the Librarian interface.

Next post:

Previous post: