Command-Line Operation (Digital Library)

Collections can be built without involving the Librarian interface at all. It simply provides a user-friendly way of working. The Gather panel copies documents into the appropriate import folder. The Enrich panel edits metadata files (metadata.xml, described in Section 10.5). The Design and Format panels edit the collection configuration file collect.cfg. Most importantly, the Create panel invokes standalone programs that do the actual work.

Walking through the operations involved in building a collection without using the Librarian interface at all will show you what is going on behind the scenes. We use the Small HTML collection that we built in Section 10.3 as an example, and on the way we take the opportunity to explain some general features and design principles, as well as the specific steps involved.

Building a collection is the process of taking a set of documents and metadata and creating all the indexes and data structures that support whatever searching, browsing, and viewing operations the collection offers. It breaks down into four phases. First, a skeleton framework is made for the collection. Then documents are imported into a standard representation from whatever format they are provided in. Next the required indexes are built, and finally the collection is installed, so that it becomes operational.

We refer to these operations as make, import, build, and install, respectively, and each one is performed by a simple computer command. The terminology is potentially confusing: we say make because the English language does not have a specific verb for creating a skeleton framework, and— worse still—build is used in two different senses, one that encompasses the whole process and another that refers to the particular sub-process of building indexes. Provided the distinction is kept in mind, it should be clear from the context whether build refers to the general process of building-in-the-large or the specific one of building-in-the-small.


Getting started

We recommend that you follow this walk-through on your own computer. We use Windows terminology, but the process for Mac and Linux is nearly identical. Some operations may seem unnecessary, but their role becomes clear later. Remember that our purpose is not to provide a streamlined way of building collections—for that, use the Librarian interface—but to explain the collection-building process. Table 11.1 summarizes the procedure, for reference, and also mentions any differences between Windows and Mac/Linux.

First locate the command prompt, which is where you type commands. This differs from one system to another, but on Windows XP look in the Start menu. Invoke the Run entry and type cmd in the dialog box. Change into the directory in which the software was installed by typing

cd "…"

where … is the actual installation folder, which we have been calling GSDLHOME. (The quotation marks are there to protect any spaces in the folder name, which is necessary on some systems.) Next, type

Table 11.1: The collection-building process

Step

Function

1.

tmp104-119

Assumes that Greenstone is installed in the default location.

2.

tmp104-120

This makes Greenstone programs available. On Mac or

tmp104-121

Linux, use source ./setup.bash instead.

3.

tmp104-122

Create a skeleton framework of initial files and directories.

tmp104-123

4.

tmp104-124

Populate the collection with sample documents. On Windows,

C:

tmp104-125

select the files and drag them. On Mac or Linux use the cp

tmp104-126

command—and if you are copying files from a CD-ROM,

tmp104-127

you may have to use the mount command first.

5.

tmp104-128

Customize the collection by editing the collection-level

C:

tmp104-129

metadata in the configuration file. Alter collectionname,

tmp104-130

collectionextra, and collectionicon.

6.

tmp104-131

Convert the source documents and metadata specifications to

tmp104-132

the Greenstone standard form.

7.

tmp104-133

Build the indexes and data structures that make the collection

tmp104-134

work.

8.

Replace the contents of the collection’s index directory with that of the building directory

On Windows, select the files and drag them. On Mac or Linux, use the mv command.

 

tmp104-135_thumb

This batch file (which is quite short—read it if you like) tells the system where to look for programs and other parts of the digital library file structure by setting the system variable GSDLHOME to the Greenstone home directory. To return to this place later, type

tmp104-136_thumb

(again, the quotation marks are there to protect spaces in the file name). If you close the command window and open another one, you must invoke setup.bat again.

Making a framework

Now you are in a position to make, import, build, and install collections. The first operation, accomplished by the Perl program mkcol.pl—the name stands for "make a collection"—creates an empty framework. Program names have cryptic abbreviations because Greenstone has traditionally run on ancient versions of Windows that impose an eight-character limit on file and folder names.

Run the program by typing perl -S mkcol.pl. (If your computer is set up to associate the Perl interpreter with files ending in .pi, drop the preamble and simply type mkcol.pl.) All Greenstone programs take at least one argument—the name of the collection being operated on—and running them without arguments prints a helpful message on the screen. This convention has the added bonus of providing up-to-date documentation for an evolving system.

As this message explains, mkcol requires you to specify the collection name. As for the other programs, there is an extensive list of options, which are preceded by the minus sign (-), but all have default values so that only a minimum of information needs to be given explicitly.

Use mkcol.pl to create a framework of initial files and folders for the new collection. Assign the collection the name mydemo by typing

tmp104-137_thumb

To examine the new file structure, move to the newly created collection directory by typing:

tmp104-138_thumb

List the directory’s contents by typing dir. The mkcol.pl program has created six folders: etc (which contains the default collection configuration file), images (for any collection specific images), import (ready for the collection’s source material), macros (containing a default collection-specific macro file), script (for any Web browser JavaScript enhancements), and style (for any collection-specific CSS files). The other files depicted in Figure 11.1 for the Demo collection are created automatically later, when they are needed.

In the new collection’s etc directory is a collection configuration file called collect.cfg, shown in Figure 11.5. The collection name appears in one of the collectionmeta lines, which give metadata concerning the collection as a whole. The file shows the same selection of plug-ins that the Librarian interface includes—not surprisingly, because behind the scenes the Librarian invokes exactly the same program.

Importing documents

The next step is to populate the collection with documents. In our case, the source material resides in the simple_html folder in the sample files that you downloaded when working through topic 10. All the source material should be placed in the new collection’s import folder. Just copy the simple_html folder (or its contents—it doesn’t matter which) and paste it into the mydemo collection’s import folder. This is precisely what the Librarian interface does in the Gather panel.

Now you are ready to perform the import process. This brings the documents into the system, standardizes their format, and extracts metadata from them and from any metadata.xml files that are present. It invokes plug-ins to process the files and extract metadata.

Type perl -S import.pl at the prompt to get a long list of options for the import program, with a brief explanation of each. In fact, if you switch the Librarian interface to Expert mode and go to the Create panel, you will find a list of import options that includes all those implemented by the Perl program—and the tool-tips contain the same explanatory text too. (A few that do not make sense from within the Librarian interface are omitted.) When the build button is pressed, the Librarian interface initiates import.pl (followed by buildcol.pl, described next) to do the work, with the specified set of options.

Next type

tmp104-139_thumb

 

 

 

Collection configuration file created by mkcol.pl

Figure 11.5: Collection configuration file created by mkcol.pl

Text scrolls past, reporting the progress of the import operation, just as it does when you use the Librarian interface. You do not have to be in any particular folder when the import command is issued because the software works out where everything is from the Greenstone home folder and the collection’s name. The -removeold option forces the collection to be built from scratch. Greenstone also supports incremental building (described below). For now, however, we rebuild collections in their entirety to keep things simple—in fact, this is common practice.

Building indexes

The next step is to build the indexes and data structures that make the collection work. This is building-in-the-small (as opposed to building-in-the-large, which refers to the whole process of making, importing, building, and installing). With a small nod toward the ambiguity of the term build, the relevant program is called buildcol.pl.

This is the stage at which you would most likely customize the new collection by editing its configuration file, as you have done in the Librarian interface’s Design and Format panels. However, just as in the first exercise of Section 10.3, we’ll go straight ahead and "build" the collection.

First type perl -S buildcol.pl at the command prompt for a list of collection-building options (again, a superset of those available to Expert users of the Librarian interface). Then, sticking to the defaults, type

tmp104-141_thumb

Progress-report text scrolls past again, which under normal conditions can be ignored. (Any serious problem will cause the program to be terminated immediately, with an error message.) The -remove-old option causes the indexes to be built from scratch.

Installing the collection

Although it has been built, the collection is not yet live—you cannot see it in your digital library. When the buildcol.pl command is used, the files that are generated are located in a special area, and the result must be moved to the proper place before the collection can be used. This is because (once you scale up your digital library operations) some collections may take hours—even days—to build, and during that period the existing version of the collection continues to serve users. Building is done in the building folder, whereas collections are served from the index folder.

To make the collection operational, select the contents of the mydemo collection’s building folder and drag them into the index folder. If index already contains some files, remove them first.

The newly built collection can now be invoked from the digital library home page. If you are using the Local Library version, you will have to restart the library program. Otherwise you need only reload the home page (although caching may conceal the change, in which case a forced reload should be sufficient; alternatively you could close and restart the browser). To view the new collection, click on its icon. If it doesn’t appear on the home page, you’ve probably forgotten to move the contents of the building folder into index.

With the Web Library version, nothing needs to be restarted. What happens if a reader is actually using the previous version of the collection at the very instant the collection is moved from building to index? Basically, nothing. If she has just done a search, and then repeats it, the results list may change. The worst that can happen is that she clicks on a document in the search results that is absent from the new version of the collection. Then she will see a blank Greenstone page and, after she re-executes the search, the document will have disappeared from the search results.

Next post:

Previous post: