EMBOSS – A sequence analysis package - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

given researcher. For example, if a program for carrying out a protein motif search were

desired, then the keyword “motif” might be given to the wossname application. The

resulting output would look like this:

SEARCH FOR 'MOTIF'

helixturnhelix Report nucleic acid binding motifs

meme Motif detection

patmatdb Search a protein sequence with a motif

patmatmotifs Search a PROSITE motif database with a protein sequence

prosextract Builds the PROSITE motif database for patmatmotifs to search

As each EMBOSS program is typically designed to perform a single analytical step,

it is often the case that several programs are required to achieve any given user objective.

For example, a multiple sequence file can be generated with the seqret program and read

directly into emma (the multiple sequence alignment program clustalw customised for

inclusion into the EMBOSS package). The output from emma is a series of sequences

containing gap characters to represent insertions or deletions throughout the alignment.

This can be manipulated to appeal more to the human eye by reading it into an alignment

viewer such as prettyplot .

There is also a more sophisticated multiple sequence alignment viewer and editor

called the Jemboss Alignment Editor which can be invoked from the EMBOSS command

line or the Jemboss GUI.

2.2

Input and Output formats

Many previous bioinformatics packages and databases have defined their own sequence

formats, which have become standards. EMBOSS has been created to recognise all these

standard sequence format types. Thus, the input to EMBOSS applications is not restricted

to sequences stored in a particular way (in contrast to the GCG package). Files generated by

using GCG may be read into EMBOSS applications. The default sequence format output in

EMBOSS is fasta format, but almost anything may be specified (in total, EMBOSS

supports 42 different sequence formats).

fasta format is a common, simple format for sequences, and can be recognised by a

sequence description line succeeded immediately by the sequence itself. A “greater than”

sign starts the description line, thus identifying it. Multiple sequences can be stored in the

same file, with the description line separating the individual sequences from each other.

Other sequence formats are also common. raw , or plain format is composed of, as

their names suggest, simply the sequence on its own. Certain applications will only accept

this type of format, but it does not hold an ID or sequence name, and cannot, be used in a

multiple sequence file as there is no indication of where the sequence starts or ends. GCG

format was devised for the GCG package, and has several lines of description, together

with sequence numbering. One of the features of a GCG format is a “checksum”. This is a

number relating directly to the sequence, and was implemented in the days when file

transfer was less reliable than today. The intention was to allow researchers to know that

they transferred an intact and correct sequence. In the current phase of more reliable

networking, this has proved a hindrance in many cases, as the sequence cannot be manually

edited before being input into another GCG, or other software application. Standard report

formats are used for alignment output. For example gff format for protein features and

markx format for alignments.The versatility of EMBOSS means that any sequence format

can be read into an appropriate application, or may be changed to an alternative format.

In addition to allowing access to sequences in many formats stored in local files and

sequences stored in locally managed databases, EMBOSS is able to access sequences stored

Essays in Bioinformatics

Search WWH ::

Custom Search

Home