Database Reference
In-Depth Information
contentextraction
Description
Indexing binary content
Namespace
contentextraction="http://exist-db.org/xquery/contentextraction"
Type/default status
Java; enabled in $EXIST_HOME/conf.xml ; enabled in $EXIST_HOME/extensions/
build.properties
Class
org.exist.contentextraction.xquery.ContentExtractionModule
The contentextraction module is an absolute wonder! It allows you to extract text
content from binary resources like Word or PDF files, based on the Apache Tika
toolkit . Everyone who has ever coped with this problem knows how difficult, frustrat‐
ing, and time-consuming it can be.
The Tika toolkit supports many formats. For instance, you can feed it an HTML page
and it will output well-formed and valid XHTML. Feed it a text file (in any character
encoding) and it will neatly create paragraphs from it. Other formats supported
include Microsoft Office, Open Document, PDF, ePub, RTF, mbox, and even the tex‐
tual/metadata parts of audio, image, and video files (see the Tika website for details).
To use the contentextraction module, insert the following import module state‐
ment in the prolog of your XQuery script:
import module namespace content = "http://exist-db.org/xquery/contentextraction"
at "java:org.exist.contentextraction.xquery.ContentExtractionModule" ;
The module has three functions (use the XQuery Function Documentation app from
the dashboard to inspect them). The following code will return the metadata and
contents of a (recognized) binary file:
let $ file := 'some/path/to/a/binary/file'
return
content:get-metadata-and-content ( util:binary-doc ( $ file ))
An interesting use case of contentextraction is, of course, indexing using the full-
text index capabilities of eXist and allowing the user to search binary documents
stored in the database (read more about this in “Manual Full-Text Indexing” on page
301 ). For indexing large documents the third, somewhat complicated, function,
content:stream-content , comes in handy. There is an interesting content extraction
example in the eXist-db demo apps (available through the dashboard) that uses this.
 
Search WWH ::




Custom Search