Developing scientifi c business applications using open source search and visualisation technologies - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

update to the original document needs to be indexed again (e.g. an

update is supplied for a document from the external suppliers), then

all annotations on the document would be lost without it being stored

and re-applied from the database;

■ rich document (e.g. Word, PDF) handling. This is provided by

integrating SOLR with the Apache Tika system [2], and is useful when

indexing a diverse set of documents in a fi le system. The text extracted

can also then be passed through other systems for further processing

using a single text-based pipeline;

■ geospatial search. Basic geospatial search has been appended to our

index, but there are possibilities of marking documents with the

author's location (e.g. country, postcode, longitude, latitude). This

means that clustering of work can be investigated. In addition, it may

be possible to utilise this capability to look for elements that are similar

to each other.

SOLR is highly scalable, providing distributed search and index replication

and at the time of writing our index currently provides sub-second searching

on over 50 million documents. It is written in Java (utilising the Lucene Java

search library) and runs as a server within a servlet container such as JeTTY

or Tomcat. It has REST-like HTTP/XML and JSON APIs that has made it

easy for us to integrate it with a number of applications. In addition, it has

been extended by experimenting with Java plug-ins to allow us to implement

specialist searches, should this be required.

The SOLR index has been found to be reliable, scalable, and quick to

install and confi gure. Performance-wise, it has proved to be comparable

to similar commercial systems (better in some cases). In addition, the

faceting functionality provided has proved invaluable. The API has

proved easy to integrate into a number of applications, ranging from

bespoke applications to excel spreadsheets. However, the main

disadvantages found have been the lack of tools to confi gure, manage

and monitor the running system, and the lack of support.

14.4 Creating the foundation layer

The AstraZeneca SOLR publication management system was fi rst

developed to provide a data source to the KOL Miner system (more

information given later). This was to be a system that automatically

would identify KOLs, especially important for groups that work

effectively by knowing which external partners to work with.

Search WWH ::

Custom Search

Home