Database Reference
In-Depth Information
contentextraction
Description
Indexing binary content
Namespace
contentextraction="http://exist-db.org/xquery/contentextraction"
Type/default status
Java; enabled in
$EXIST_HOME/conf.xml
; enabled in
$EXIST_HOME/extensions/
build.properties
Class
org.exist.contentextraction.xquery.ContentExtractionModule
The
contentextraction
module is an absolute wonder! It allows you to extract text
content from binary resources like Word or PDF files, based on the
Apache Tika
ing, and time-consuming it can be.
The Tika toolkit supports many formats. For instance, you can feed it an HTML page
and it will output well-formed and valid XHTML. Feed it a text file (in any character
encoding) and it will neatly create paragraphs from it. Other formats supported
include Microsoft Office, Open Document, PDF, ePub, RTF, mbox, and even the tex‐
tual/metadata parts of audio, image, and video files (see the Tika website for details).
To use the
contentextraction
module, insert the following
import module
state‐
ment in the prolog of your XQuery script:
import
module
namespace
content
=
"http://exist-db.org/xquery/contentextraction"
at
"java:org.exist.contentextraction.xquery.ContentExtractionModule"
;
The module has three functions (use the XQuery Function Documentation app from
the dashboard to inspect them). The following code will return the metadata and
contents of a (recognized) binary file:
let
$
file
:=
'some/path/to/a/binary/file'
return
content:get-metadata-and-content
(
util:binary-doc
(
$
file
))
An interesting use case of
contentextraction
is, of course, indexing using the full-
text index capabilities of eXist and allowing the user to search binary documents
stored in the database (read more about this in
“Manual Full-Text Indexing” on page
301
). For indexing large documents the third, somewhat complicated, function,
content:stream-content
, comes in handy. There is an interesting content extraction
example in the eXist-db demo apps (available through the dashboard) that uses this.
Search WWH ::
Custom Search