Inside Greenstone (Digital Library)

Where does the digital library system live on your computer disk? And, why would you want to know? Sometimes you need to find certain configuration files in order to control various options and facilities. Since the software is open source, you can examine any part of the system—including the programs themselves.

Most likely you are not interested in the program code, but you probably will want to know where the collections themselves are kept. Greenstone collections are self-contained nuggets of information. You can give them to others by placing them on a USB flash drive or CD-ROM and having the recipient copy them into the appropriate part of their digital library installation.

But suppose they don’t have the Greenstone software? It’s possible—trivial, in fact—to take any collection or set of collections and write them to CD-ROM or DVD in such a way that users with no Greenstone installation and no Internet connection can access them just as though they were on the Web. These disks contain a mini version of the software that installs on any Windows computer in a matter of moments and includes all the searching, browsing, and multi-lingual interface facilities that the full system offers.

But before we begin, here’s a brief plea: Please keep your Greenstone installation up to date!

Updating the software

In contrast to commercial software, open source software is characterized by frequent releases containing bug fixes and new features. New releases of Greenstone are typically issued every few months. If you can’t wait to get your hands on a new feature, a snapshot of the system is generated automatically every day.


It is very easy to upgrade to a newer version. Before doing so, however, ensure that the computer is not running the Librarian interface or the Greenstone server. Close the Greenstone server by selecting the disk icon in the task bar (to bring the window to the front) and then clicking the exit hotspot— usually marked as a red cross. The re-installation procedure is exactly the same as the original installation. Greenstone installers check for the existence of previous versions and ask you to uninstall them if necessary—on Windows uninstalling is done either through the Control Panel or by selecting the relevant Greenstone item on the Start menu. At the end of the uninstall procedure you will be asked whether you want all your collections removed: say no if you wish to preserve your work.

Occasionally, problems are encountered if older installations are not fully removed. To clean up your system—having already run the installer—move Greenstone’s collect folder, which contains all your collections, to the desktop (or somewhere else that is convenient for you, such as My Documents). Then check any places where Greenstone has been installed previously and delete them.

Files and folders

Before going any further, you need to learn how to find your way around the software. Figure 11.1 shows the structure of the Greenstone home directory, including one collection, the Demo collection.

First, locate your Greenstone installation. On Windows, by default it is installed in C:\Program Files\ Greenstone if the user had full administration rights when it was installed, otherwise C:\Documents and Settings\<username>\Greenstone is used for Windows XP and C:\Users\<username>\Greenstone for Windows Vista. Comparable logic applies to Mac and Linux distributions: /usr/local/Greenstone is used by default when a system administrator installs the software, otherwise /home/<username>/ Greenstone is used. We call this the GSDLHOME folder. It contains several sub-folders, illustrated in Figure 11.1. The most important one is called collect, which holds all the collections in your installation. Inside it is a folder called demo, and many other folders, too, if you have worked through the exercises in the previous topic. Each collection is self-contained and corresponds to a single sub-folder of collect. We describe the collect folder soon.

Before we do so, check out the remaining folders of Figure 11.1. Some hold program code. For example, bin (short for executable binary) contains the programs that are used in the building process. Binary programs are held in the folder linux for Linux, windows for Microsoft Windows, and darwin for MacOS X (which is the name Apple gave to its version of Unix). Typically, installations contain binaries only for the platform they are running on—except when an installation, say on a USB flash drive, is set up to run on multiple platforms. Figure 11.1 is based on a Linux installation.

The script sub-folder holds the programs used to create, build, and rebuild collections, which are written in the Perl programming language and therefore platform independent. The same is true for the programs in the java folder. The perllib folder near the end contains program modules that are used by the building scripts. Plug-ins and classifiers are placed in the corresponding sub-folders. (Plug-ins are discussed further in Section 11.4.) The cgi-bin folder contains the Greenstone runtime system that works with a Web server. In the Local Library version of Greenstone on Windows, the necessary runtime executable (server.exe) is placed in the top-level GSDLHOME folder.

Structure of the Greenstone home directory (abridged)

Figure 11.1: Structure of the Greenstone home directory (abridged)

What about the source code? The runtime system is written in the C++ language. The build-time code is primarily written in Perl but calls upon some external C modules. The source code is distributed across three folders: build-src, runtime-src, and common-src, the last containing code that both parts of Greenstone need. Code specifically written for Greenstone is located within each of these folders in a sub-folder called src, and any third-party software (all distributed under the GNU Public License and other compatible licenses) is located in a sub-folder named packages.

The main Greenstone source code in the runtime-src folder comprises the "collection server" (col-servr in Figure 11.1), the "receptionist" (recpt), and the "protocol" they share (not shown). An example of source code in build-src is hashSle, used to compute document IDs, and in common-src a general purpose library lib is provided that reads Greenstone configuration files and represents and manipulates strings in Unicode format, among other things.

The functions of third-party packages vary widely. The packages folder of build-src contains (among many other things) a program that converts from HTML to XML (html-tidy in Figure 11.1), a Web mirroring program (wget), and a utility that converts PDF documents to HTML (pdf2html). The packages folder of common-src contains Expat, a utility for parsing XML, and GDBM, a standard database manager program. The packages folder of runtime-src contains software related to the Apache Web server (apache-httpd in Figure 11.1) and a package dealing with the Z39.50 protocol (yaz). Each package is stored in a folder of its own, with a readme file that gives more information about the package. Resulting executable programs are placed in bin—in the sub-folder corresponding to their underlying operating system—when the software is compiled.

Full-text indexing is central to Greenstone. There are three alternative indexers, called MG, MGPP, and Lucene; near the end of Section 11.4 we show how to switch between them within the Librarian interface. They are stored in a top-level folder of the common-src folder called indexers (not shown). Within this is a packages folder that contains any third-party software that the indexers use.

The mappings folder contains Unicode translation tables. The etc folder holds configuration files for the system. It also includes initialization and error logs, and the user authorization database. Inside the top-level web folder, images stores images for the user interface, among them icons like those shown in Table 10.1, and style stores the cascading style sheet (CSS) files. The user interface is constructed by small code fragments called macros, and these are placed in the macros folder. Depending on the type of installation and how it has been configured you may also have the following top-level folders: apache-httpd which contains a Web server, gli which contains the Librarian interface, tmp for storing temporary files, docs which contains the documentation for the system, and packages which contains a Java runtime environment.

The Librarian interface maintains a small amount of information that is specific to each particular user. This includes the user’s Preferences (accessed from its File menu) and the cache that is used when downloading (see Section 11.5). You will find this information in C:\Documents and Settings\<username>\Application Data\Greenstone\GLIon Windows XP and C:\Users\<username>\ Application Data\Greenstone\GLIon Windows Vista. (This is for Windows; on Mac and Unix systems it is in a folder called .gli within the user’s home directory.)

Collections

Each collection corresponds to a sub-folder of collect. Collections are completely self-contained. For example, if you have created a collection—perhaps one of the examples in topic 10—and you want to give it to someone else who is also running Greenstone, just locate the appropriate collect sub-folder, put it on a USB flash drive, take it to their computer, and transfer it into their collect folder. It will appear right away on their Greenstone home page. (Windows Local Library users will have to restart Greenstone first.)

You may also need to move collections around if the software has previously been installed in a non-standard place. Old collections can be transferred to the new installation by moving them from the collect folder in the old place into the folder GSDLHOME\collect. (We use Windows terminology because this is the most popular platform for Greenstone in practice.)

Figure 11.1 shows the structure of the Demo collection—it’s the same for all collections. The import folder is where the original source material is placed, and the archives folder is where the result of the import process goes (the first stage of building a collection). The building folder is used temporarily during building (the second stage), whereupon its contents are moved into index. The index folder is where the result of the entire building process is placed, and contains all the information that is served to readers. The etc folder contains miscellaneous files, such as configuration and mapping files, and logging information. The metadata sets used by the collection go into a folder of the same name, as do any log files that are written every time the collection is built. Some collections have additional folders: one for images that are used in the collection, another for collection-specific macros, and a third for any special Perl modules that pertain to this collection.

Some of these folders may be absent. Most can be deleted once the collection is built; all the information required to serve the collection is in index and etc (and, if present, images and macros). However, then the collection could not be rebuilt—nor loaded into the Librarian interface. Section 11.3 describes the building process in more detail and also gives further information about the role of the various folders.

Just as the top-level etc folder in the Greenstone file structure holds configuration information for the system as a whole, there is an etc folder for each collection. Configuration information for the collection is recorded in a file there called collect.cfg, along with other miscellaneous information. It records the result of the collection design and formatting process—indeed the details displayed in the main panel of the Librarian interface about a collection’s plug-ins, indexes, classifiers, and format statements are exactly what make up this file. It is plain text: choose any collection and take a look at its etc\collect.cfg.

Greenstone CD-ROM/DVDs

You can give out a Greenstone collection on a USB flash drive, or even as a zipped e-mail attachment containing the etc and index folders if the collection is small. Of course, the recipient must be running the Greenstone software in order to make use of it.

Alternatively, collections can also be published as self-installing Windows CD-ROMs or DVDs. These are disks that begin the installation process as soon as they are placed in the drive. They do not install the full Greenstone software, just a mini version that allows users to view existing collections but not build new ones. An installation option lets you choose whether to install all the collection files onto your computer disk, or just the software, in which case the CD-ROM/DVD must be present in the disk drive whenever the collections are used. The former option takes more time to install but responds more quickly to the reader’s requests. Either way, interaction—including all browsing and searching, changing preferences, switching languages, etc.—is just the same as on the Web, except that response times are more consistent.

It is very easy to create a self-installing disk containing your own collections. Enter the Librarian interface and choose File^Write CD/DVD image. Select the collections you wish to export by ticking their check boxes in the window that pops up. If you enter a name for the disk it will appear in the menu when the CD-ROM/DVD is run, otherwise it will be called "Greenstone Collections." You can choose whether the CD-ROM/DVD runs directly from the disk drive or installs some files onto the computer first.

Click Write CD/DVD image to start the export process. This puts files into a temporary folder called exported_xxx (or some such); the interface tells you where it is. The process involves copying many files and may take a few minutes. You need to use your own computer’s software to write the generated files to CD-ROM/DVD. On modern computers this capability is built into the operating system: just insert a blank disk into the drive and drag the contents of exported_xxx into the folder that represents the disk. It is equally simple to plug in your USB flash drive or portable media player and use it in disk-mode.

Collections installed from prepackaged Greenstone disks do not reside in the standard collect folder but in C:\GSDL\collect. To amalgamate them with your main Greenstone installation, move them into GSDLHOME\collect. After you’ve done this, the mini version of the software that runs the prepackaged collections is no longer necessary: you can uninstall it from the Greenstone section of the Windows Start menu.

Next post:

Previous post: