Under the Hood (Digital Library)

Now that you know how to invoke the underlying programs that build collections, you are in a position to learn a little more about what happens under the hood. This information could prove useful if you need to dig deeper into the structure of the archives folder, or to evaluate the pros and cons of different ways of choosing document identifiers, or to learn a little more about plug-ins and what they do, or to understand the many options that are available for searching.

Importing and building

The two main components of the collection-building process are importing (import.pl) and building in the narrow sense (buildcol.pl). The former brings documents and metadata into the system in a standardized XML format that is used internally. The latter creates the indexes and data structures needed to make the collection operational. Both components have many options, which can be viewed from the command line or in the Create panel in Expert mode.

The import process takes the information in the import folder (including metadata.xml files, if any), converts it to a standardized XML format (see next section), and puts the result into the archives folder. If desired, the original material can then be safely deleted, because the collection can be rebuilt from the archive files. If it is deleted, new material can be added to the collection by placing it in import and re-executing the process, this time using -keepold instead of -removeold: the new material will find its way into archives along with what is already there. If following the route of removing source documents from the import folder once they have been processed, then to retain a collection in a form that can be augmented and rebuilt later, do not delete the archives.

Each document’s archive file occupies its own folder in the archives structure, along with any files associated with the document—such as image files included in it. The folder name is based on the document’s object identifier. The structure is rather arcane, because it is designed to work on primitive computers (e.g., early versions of Windows) that restrict the length of file names, the number of files in a folder, and the maximum nesting depth of folders.

You will also notice two files at the top level of the archives folder: archiveinf-doc.gdb and archiveinf-src.gdb. These are database files that support incremental building. The former stores information about where each document is located within archives (and what files comprise the document); the latter records where these files came from in the import folder.

Incremental building

Documents can be added, removed, or changed incrementally using the incremental option of the import and build processes. In the Librarian interface, this is enabled by clicking the Minimal Rebuild check-box in the Create panel. Although the import process can always work incrementally, whether or not the building process can do so depends on the indexer being used. The indexers MG and MGPP are non-incremental, but the Lucene indexer supports true incremental building, and both building and importing are performed incrementally if the Librarian interface’s Minimal Rebuild is enabled.

When incrementally building with Lucene, the files generated should no longer be placed in a separate area—the indexer needs to be able to find the current indexes and change them incrementally. In other words, there is no "install" step to perform. This can be accomplished by adding the option

to the buildcol.pl command (as well as -incremental), substituting the actual name of the collection for <collectionname>. As this is rather tedious to type, Greenstone provides two extra commands: incremental-import.pl and incremental-buildcol.pl, which set these options for you. A complementary pair, full-import.pl and full-buildcol.pl, are provided for rebuilding from scratch.

A long-standing historical deficiency of Greenstone, now rectified, has been that the build process needed to be re-run from scratch when documents are added, modified, or deleted. This limitation arose because the original MG and MGPP indexers are optimized for maximum compression, which requires non-incremental operation to ensure that the word statistics on which compression is based reflect the collection as a whole. However, Lucene, which Greenstone now incorporates as an option, is capable of operating incrementally.

Scheduled rebuilding

Greenstone incorporates a user-oriented module for scheduled maintenance of collections. This automates the construction of any existing collection and schedules rebuilding to occur periodically. At the command-line level, daily rebuilding of a collection called pics is accomplished by invoking

for MG- and MGPP-based collections, and

for a Lucene-based collection.

This generates a script for rebuilding according to the specified options. It also inserts a record into a configuration file for time-based scheduling that calls the generated script using an operating system service called cron (for chronograph), set to be executed daily. (Cron is standard on Unix systems: Greenstone includes a port to Windows to make scheduled rebuilding work there too.) It is important that scheduled builds are completed in their entirety without interference from another scheduled build. To handle this, the Perl script first checks for the presence of a lock file, which indicates that a build is already underway, and ensures that multiple builds do not occur concurrently.

Scheduled rebuilding can be specified within the Librarian interface. In Expert mode, in addition to Import options and Build options, the Create panel contains a Schedule Options tab where scheduling parameters can be specified. Collections can be built hourly (on the hour), daily (at midnight), or weekly (at midnight on Sundays), and the user can request e-mail notification of each build. The output of the script is kept in a log file. Before setting up an automatic scheduled rebuild, the user can manually build and configure the collection as many times as necessary to confirm that the correct sequence is being performed.

Archive formats

When you import documents and Greenstone puts them in the archives folder, there are two standard forms in which they can be represented:

• Greenstone Archive format

• METS (see Section 6.4), using the Greenstone METS profile.

An option to the import process (saveas), which is only visible in the Librarian interface’s Expert mode, dictates which one is used; the former is the default.

The building (building-in-the-small) process uses plug-ins to process documents just as the import process does, and for this to work, the list of plug-ins must include one that processes whatever archive format is being used. This is why the Greenstone Archive format plug-in (called Green-stoneXMLPlugin) is specified at the top of the list, and it cannot be deleted by Librarian-level users. However, Expert-level users can specify METS as the save format for the import process and replace GreenstoneXMLPlugin in the plug-in list by Greenstone’s METS plug-in.

Using METS is a worthwhile experiment. Open any collection in the Librarian interface and switch to Expert mode. In the Create panel, change the import process’s saveas option to Greenstone METS, and in the Design panel delete GreenstoneXMLPlugin and replace it by GreenstoneMETSPlugin. Build the collection, and locate its archives folder in the Windows file browser (in Greenstone—collect—<collection name>—archives). Two files are generated for each document: docmets.xml, the core METS description; and doctxt.xml, a supporting file. (Depending on how you view doctxt.xml, you may need to be connected to the Internet, because it refers to a remote resource.) Depending on the source documents, there may be additional files, such as images used in a Web page. One of the many features of METS is the ability to reference information in external XML files. This is used to tie the content of the document, which is stored in the file doctxt.xml, to its hierarchical structure, which is described in the core METS file docmets.xml.

The remainder of this section describes the Greenstone Archive format, which is used by default. (Readers may first wish to reacquaint themselves with Section 4.3, which reviews HTML and XML.) Documents are divided into sections and metadata is stored at the document or section level. One design requirement is the ability to represent any previously marked-up document that uses HTML tags, even if the markup is sloppy. Another is that archive documents must be parsed rapidly. The archive format is an XML-compliant syntax that contains explicit top-level markup for sectioning and metadata and can also embed HTML-style markup that is not interpreted at the top level.

In XML, tags are enclosed in angle brackets for markup, just like HTML tags. The archive format encodes documents that are already in HTML by escaping any embedded left or right angle bracket (<, >), or quote (") characters within the original text using the standard codes <, &g£, and ". A <Section> tag signals the start of each section of the document, and the corresponding closing tag marks its end. Sections begin with a block that defines pertinent metadata. Metadata specifications give the metadata name and its value. In addition to regular metadata, the file that contains the original document is specified as gsdlsourcefilename, and files that are associated with the document, such as image files, are specified as gsdlassocfile.

Figure 11.6a shows the XML document type definition for the Greenstone Archive format. Documents are divided into sections, which can be nested. Each section has a description part that comprises zero or more metadata items, and a content part that holds the document’s contents. (The content may be null, e.g., for image files, audio files, or the dummy documents that are created by exploding metadata-only files, as described in Section 10.5). A name attribute and some textual data are associated with each metadata element. In XML, PCDATA stands for "parsed character data"— Unicode text in this case.

Figure 11.6b shows a simple document comprising a short book with two associated images. The book has two sections, called Preface and First and only topic; the latter has two subsections. Topics are simply top-level sections. Metadata is stored at the beginning of each section.

This structure serves two purposes. First, it allows readers to browse around inside documents once they have been located. When you open a book, the table of contents shows the section hierarchy. Figure 3.1 (topic 3) illustrates browsing within a book that has a hierarchical table of contents showing topics, sections, and subsections. In some collections, documents are split into pages instead, and Figure 3.2 shows (at the top right) a page selector for such a document. Topics, sections, subsections, and pages are all "sections."

The second use of document structure is for searchable indexes. There are two levels of index: document and section, and most collections use them both. The first relates to complete documents—you use it to find all documents that contain a particular set of words. When a section index is created, each portion of text that is indexed stretches from one Section tag to the next—thus a topic that immediately begins with a new section will produce an empty document in the index (such documents are hidden from the user). Sections and subsections are treated alike: the hierarchical structure is flattened for the purposes of creating searchable indexes. The MG indexer (but not MGPP or Lucene) provides a third level, paragraph, which treats each paragraph as a separate document and is useful for more focused searches.

Document identifiers

The import process assigns object identifiers (OIDs) to documents, which are then stored as a metadata element in the document’s archive file (Figure 11.6b). If the import process is re-executed, documents should receive the same identifier. The method of obtaining identifiers can be specified as an option to the import process (this option is available to Librarian-level users). There are four possibilities:

• hash the contents of the file

• use the value of a particular metadata item

• use a simple document count

• use the name of the parent folder.

Figure 11.6: Greenstone archive format: (a) document type definition (DTD); (b) example document

The first, which is the default method, calculates a pseudo-random number based on the content of the document—called hashing. This ensures that identifiers will be the same every time a given document is imported: if the content changes, so does the identifier; if not, the identifier remains the same. Identical copies of a document will be assigned the same identifier and thereafter treated by the system as one. The same document can appear in two different collections: if so, searching both collections will return just one copy of the document. These identifiers are character strings starting with the letters HASH: for example, HASH0158f56086efffe592636058. They are long enough that the chance of different documents receiving the same one is vanishingly small, and this possibility is ignored in the system.

For the second method of assigning OIDs, the user specifies a metadata element that holds the document’s unique identifier (havoc will ensue if it is not unique). If that metadata value is unspecified for a particular document, the hash value will be used instead. (The value is preceded by D if the identifier is purely numeric, to prevent confusion with document numbers, used internally). The third method is significantly faster than hashing, but does not necessarily assign the same identifier to the same document when the collection is rebuilt from scratch with additional documents. The fourth is intended for situations where there is only one document per folder, and folder names are unique (the folder name is preceded by J).

Identifiers are extended to individual sections of a document using integers separated by periods. For example, since the OID of the document in Figure 11.6b is HASH015.. .058, the OID of the first subsection of the First and only topic is HASH015.. .058.2.1—because that topic is the second in its enclosing section, and the relevant subsection is the first in its enclosing section. Section-level identifiers are not stored explicitly but are used internally to represent individual document sections that are returned as the result of a search. They do not necessarily coincide with the logical numbering of topics and sections—documents often include unnumbered topics, such as a Preface—but are only for internal use.

Plug-ins

Most of the import process’s work is accomplished by plug-ins. These operate in the order in which they appear in the collection’s configuration file. Each input file is passed to the plug-ins in turn until one is found that can process it—thus earlier plug-ins take priority over later ones. Document formats are usually determined by filename extensions—for example, foo.txt is processed as a text file, foo.html as HTML, and foo.doc as a Word file. It is also possible for a plug-in to open a file and inspect its contents before deciding whether to process it. This is the norm for XML files with the generic extension .xml, whose type can only be determined by opening them up and examining the root element. If there is no plug-in that can process the file, a warning is printed and attention passes to the next file.

Plug-ins can inherit the functionality of other plug-ins to perform common tasks, such as converting images and extracting key phrases, dates, and e-mail addresses; these are visible as separate parts of the plug-in configuration panel in the Librarian interface. Also, the Word and PDF plug-ins work by converting source documents to an intermediate HTML form and passing this to the HTML plug-in.

Plug-ins are used for both importing and building, and both processes use the same plug-in list. Importing generates files that are processed by a special plug-in (GreenstoneXMLPlugin) that recognizes Greenstone Archive format documents. These do not occur in the import folder, but they certainly do occur in the archives folder. ArchivesInfPlugin is also used during building only: it processes the document OIDs that were produced during importing and stored in the archiveinf-doc.gdb file mentioned earlier.

DirectoryPlugin is another special plug-in that is included in every collection. It traverses the folder structure in the import directory. You place the whole file structure containing the source material into import, and DirectoryPlugin recurses through this structure. It only processes folders, and it operates by creating a list of all the files they contain (including sub-folders) and passing the name of each back through the plug-in list. The effect is to expand all directories in the hierarchy. DirectoryPlugin is the last member of the list in all collection configuration files. In the Librarian interface, MetadataXMLPlugin, DirectoryPlugin, and ArchivesInfPlugin are all invisible unless you are in Expert mode, and even then they cannot be removed. Librarian-level users are prevented from removing GreenstoneXMLPlugin, because they do not see the saveas option to the import process.

Greenstone has over 30 plug-ins—and the number is growing. It is impossible to cover them all here. Descriptions, along with a list of all their options and a brief account of each one, can be obtained using the program pluginfo.pl—just specify the plug-in name as the argument. The most useful options can be seen in the Librarian interface, with tool-tips that give the same information as the pluginfo program (including translations into different languages).

Here is a summary of the plug-ins for widely used document formats that are included in the default collection configuration file, along with the file names processed by each.

HTMLPlugin (.htm, .html; also .shtml, .shm, .asp, .php, .cgi)

HTMLPlugin processes HTML files. It extracts title metadata based on the <title> tag; other metadata expressed using HTML meta-tags can be extracted too. This plug-in has many options, some of which are discussed in topic 10.

WORDPlugin (.doc) and RTFPlugin (.rtf)

WORDPlugin and RTFPlugin import Microsoft Word documents. There are many different variants on the Word format—and even Microsoft programs sometimes make conversion errors. As mentioned in topic 10, independent programs are used to convert Word files to HTML. For some older Word formats, the system resorts to a simple extraction algorithm that finds all text strings in the input file.

PDFPlugin (.pdf)

PDFPlugin imports documents in PDF, Adobe’s Portable Document Format. Like the Word plug-in, it uses an independent program to convert PDF files to HTML. Note that encrypted PDF files cannot be processed.

PostScriptPlugin (.ps)

PostScriptPlugin imports documents in PostScript. It uses an independent program called Ghostscript, which is included as an option when installing Greenstone.

ImagePlugin (.jpg, .jpeg, .gif, .png, .bmp, .xbm,.tif, .tiff)

ImagePlugin handles images of many different kinds, using an external program called ImageMagick. It (optionally) computes thumbnail and screen-size versions of the image and extracts metadata giving its type, size, height, and width.

TextPlugin (.txt, .text)

TextPlugin interprets a plain text file as a simple document. It adds title metadata based on the file’s first line.

EmailPlugin (.email)

EmailPlugin imports files containing e-mail and deals with open e-mail formats, such as those used by the Thunderbird, Eudora, and Unix mail readers. Each source document is examined to see if it contains an e-mail, or several e-mails joined together in one file, and if so, its contents are processed. The plug-in extracts Subject, To, From, and Date metadata. However, it does not handle MIME-encoded e-mails properly—although legible, they often look rather strange.

ZIPPlugin (.gz, .z, .tgz, .taz, .bz, .zip, .jar, .tar)

ZIPPlugin handles compressed and/or archived document formats: gzip (.gz, .z, .tgz, .taz), bzip (.bz), zip (.zip, .jar), and tar (.tar).

NulPlugin (.nul)

NulPlugin handles the dummy files generated by the metadata database exploding process explained in Section 10.5.

ISISPlug (.mst)

ISISPlug handles metadata in CDS/ISIS format (popular in developing countries but almost unknown elsewhere). Along with the master file (.mst), CDS/ISIS databases also have field definition files (.fdt) and cross-reference files (.xrf). The Greenstone wiki contains a comprehensive document that explains how to deal with CDS/ISIS in Greenstone.

Search indexes

There are three indexers that can be used for full-text searching: MG, MGPP, and Lucene. The MG search engine, implemented in the C programming language, is described in the classic book Managing Gigabytes (mentioned in Section 3.7). It can produce separate indexes that operate at the document, section, or paragraph level—meaning that when several search terms are specified, the combination is sought in the entire document, or in an individual section, or in an individual paragraph. Searches can be either Boolean or ranked (but not both at once). A separate physical index is created for each index specified in the collection. For phrase search, MG uses a post-retrieval scan, which is potentially slow (Section 3.4). It is otherwise very fast and has been extensively tested on large collections.

MGPP, which is the default indexer for new collections, is a reimplementation of MG in the C++ programming language, with some enhancements. Behind the scenes, it operates at the word level, which allows fielded, phrase, and proximity searching to be handled efficiently. Boolean searches can be ranked. Different document/section levels, and text and metadata fields, are all handled in a single data structure, which reduces the collection size compared with MG for collections with many indexes. Phrase searching is much faster than for MG, but ordinary searching may be slightly slower due to the word-level, rather than section-level, indexing.

Lucene was developed by the Apache Software Foundation and is written in Java. It handles field and proximity searching, but only at a single level—which means that documents and sections require separate indexes. Its range of functionality is similar to that of MGPP, with the addition of single-character wildcards, range searching, and sorting of search results by metadata fields. It also has a "fuzzy search" option that allows approximate searching. It was added to facilitate incremental collection building, which MG and MGPP do not provide.

Adding and configuring indexes

Indexes are added in the Search Indexes section of the Librarian interface’s Design panel. A configuration panel pops up allowing you to determine whether the index should include the full text of documents, along with any selection of metadata fields. (See the discussion of building the Word and PDF collection in Section 10.3.) As well as allowing selection of any set of metadata fields, MGPP and Lucene provide an Add All button as an easy way of adding all metadata and text sources as individual indexes. Also, a special index is available for MGPP and Lucene (called allfields) that searches over all specified indexes without having to specify a separate index that contains all sources. To add this, check the Add combined searching over all assigned indexes box on the index configuration panel and click Add Index.

The top right of the Search Indexes panel shows which indexer will be used when the collection is built. When the indexer is changed, a window pops up with a list of three options: MG, MGPP, and Lucene. Changing indexers affects how the indexes are built and may affect search functionality. The default structure creates three indexes: full text, titles, and file names (the last is rarely useful). Indexes can be at document, section, or (for MG only) paragraph level.

With MGPP and Lucene, the levels are document, section, or both. Levels are determined globally for all indexes, and the two possibilities are shown as check-boxes on the main Search Indexes panel. With MG, each index has its own level—document, section, or paragraph—which is chosen by the Indexing level selector on its configuration panel.

For MGPP, stemming, case-folding, and accent-folding are determined by check-boxes on the main Search Indexes panel. For MG, only stemming and case-folding can be selected. For Lucene there is no choice. There is also a check-box for the Chinese-Japanese-Korean (CJK) languages, which separates ideographs by spaces for searching—it is a place-holder for more sophisticated schemes for word segmentation, described in Section 8.4.

Partitioning indexes

Indexes are built on particular text or metadata sources. As you learned when building the large HTML collection in Section 10.3, the search space can be further controlled by partitioning the indexes. In Section 10.3, you designed a filter based on a metadata value, but indexes can also be partitioned based on the language of the documents.

Note that for MG collections, the total number of partitions generated is a combination of all indexes, sub-collection filters, and languages chosen. Two indexes with two subcollection filters in two languages would yield eight index partitions, which would result in eight separate index data structures (that is, eight separate physical indexes) when the collection was built. For MGPP, all logical indexes are created in one physical index, so there would only be four separate index partitions. For Lucene, the number of physical indexes is determined by the number of levels assigned to the collection, one index per level. In the above situation, if one level were assigned, there would be four physical indexes, while with two levels there would be eight.

Experimenting with MGPP and Lucene

To experiment with the search options for the MGPP indexer, start a new collection (File^New) and base it on the Greenstone Demo collection. In the Gather panel, drag all the folders in sample_files— demo—HTML into the new collection. In the Search Indexes section of the Design panel, note that the MGPP indexer is being used. This is because the original Demo collection on which this collection is based uses MGPP. Note the three options at the bottom of the panel—stem, case-fold and accent-fold. Enabled options appear on the collection’s Preferences page. Under Indexing Levels, select section as well as document. Build and preview the collection.

To use the Lucene indexer, click the Change button at the top right corner of the panel and select Lucene in the window that pops up. Build the collection and experiment with searching. Lucene provides range searching and single- and multi-letter wildcards. The full query syntax is quite complex (see http://lucene.apache.org/java/docs). To get a brief taste, here is how to use wildcards. The character * stands for any sequence of letters—a multi-letter wildcard—and can be appended to any query term. For example, econom* searches for words like econometrics, economist, economical, and economy. The character ? is a single-letter wildcard. For example, searching for economi??will match words like economist, economics, and economies that have just two more letters. Note that Lucene uses stopwords by default, so searching for words like the returns no matches (an appropriate message appears on the search results page).

MGPP does not support wildcards, but unlike Lucene it can optionally stem, case-fold, and accent-fold, as noted above. Both are enabled by default during the building process (but on the collection’s Preferences panel stemming is enabled and case-folding disabled by default). Change the indexer back to MGPP and rebuild the collection. Note that searching for econom returns no documents, while searching for fao and FAO give the same result—78 words; nine matched documents. Go to the Preferences page and change "whole word must match" to "ignore word endings." When you return to the search page and seek econom again, several documents are found. To avoid confusion later on, revert to the original Preferences setting. To differentiate upper-case from lower-case, set the "ignore case differences" option appropriately and search for fao and FAO again. This time the results are different. Again, revert to the original setting.

Behind the scenes, these effects are achieved by appending modifiers to the query. Appending #s to a query explicitly enables the "ignore word endings" option—if you search for econom#s, you will find several matches, even though the mode on the Preference page has been restored to "whole word must match." These modifiers override the Preferences settings. Likewise, appending #u makes the current query explicitly match whole words, irrespective of the Preferences setting. Modifiers #i and #c control case sensitivity. Appending #i to a query term ignores case differences, while appending #c takes them into account. For example, searching for fao#c returns no documents. Modifiers can be used in combination. Appending #uc to a query term will match the whole term (without stemming) in its exact form (differentiating upper- from lower-case).

MG also supports stemming and case-folding, but in a more primitive way, and these options cannot be chosen for individual query terms as they can with MGPP.