Biomedical Engineering Reference
In-Depth Information
update to the original document needs to be indexed again (e.g. an
update is supplied for a document from the external suppliers), then
all annotations on the document would be lost without it being stored
and re-applied from the database;
rich document (e.g. Word, PDF) handling. This is provided by
integrating SOLR with the Apache Tika system [2], and is useful when
indexing a diverse set of documents in a fi le system. The text extracted
can also then be passed through other systems for further processing
using a single text-based pipeline;
geospatial search. Basic geospatial search has been appended to our
index, but there are possibilities of marking documents with the
author's location (e.g. country, postcode, longitude, latitude). This
means that clustering of work can be investigated. In addition, it may
be possible to utilise this capability to look for elements that are similar
to each other.
SOLR is highly scalable, providing distributed search and index replication
and at the time of writing our index currently provides sub-second searching
on over 50 million documents. It is written in Java (utilising the Lucene Java
search library) and runs as a server within a servlet container such as JeTTY
or Tomcat. It has REST-like HTTP/XML and JSON APIs that has made it
easy for us to integrate it with a number of applications. In addition, it has
been extended by experimenting with Java plug-ins to allow us to implement
specialist searches, should this be required.
The SOLR index has been found to be reliable, scalable, and quick to
install and confi gure. Performance-wise, it has proved to be comparable
to similar commercial systems (better in some cases). In addition, the
faceting functionality provided has proved invaluable. The API has
proved easy to integrate into a number of applications, ranging from
bespoke applications to excel spreadsheets. However, the main
disadvantages found have been the lack of tools to confi gure, manage
and monitor the running system, and the lack of support.
￿ ￿ ￿ ￿ ￿
14.4 Creating the foundation layer
The AstraZeneca SOLR publication management system was fi rst
developed to provide a data source to the KOL Miner system (more
information given later). This was to be a system that automatically
would identify KOLs, especially important for groups that work
effectively by knowing which external partners to work with.
 
Search WWH ::




Custom Search