Information Technology Reference
In-Depth Information
given researcher. For example, if a program for carrying out a protein motif search were
desired, then the keyword “motif” might be given to the wossname application. The
resulting output would look like this:
SEARCH FOR 'MOTIF'
helixturnhelix Report nucleic acid binding motifs
meme Motif detection
patmatdb Search a protein sequence with a motif
patmatmotifs Search a PROSITE motif database with a protein sequence
prosextract Builds the PROSITE motif database for patmatmotifs to search
As each EMBOSS program is typically designed to perform a single analytical step,
it is often the case that several programs are required to achieve any given user objective.
For example, a multiple sequence file can be generated with the seqret program and read
directly into emma (the multiple sequence alignment program clustalw customised for
inclusion into the EMBOSS package). The output from emma is a series of sequences
containing gap characters to represent insertions or deletions throughout the alignment.
This can be manipulated to appeal more to the human eye by reading it into an alignment
viewer such as prettyplot .
There is also a more sophisticated multiple sequence alignment viewer and editor
called the Jemboss Alignment Editor which can be invoked from the EMBOSS command
line or the Jemboss GUI.
2.2
Input and Output formats
Many previous bioinformatics packages and databases have defined their own sequence
formats, which have become standards. EMBOSS has been created to recognise all these
standard sequence format types. Thus, the input to EMBOSS applications is not restricted
to sequences stored in a particular way (in contrast to the GCG package). Files generated by
using GCG may be read into EMBOSS applications. The default sequence format output in
EMBOSS is fasta format, but almost anything may be specified (in total, EMBOSS
supports 42 different sequence formats).
fasta format is a common, simple format for sequences, and can be recognised by a
sequence description line succeeded immediately by the sequence itself. A “greater than”
sign starts the description line, thus identifying it. Multiple sequences can be stored in the
same file, with the description line separating the individual sequences from each other.
Other sequence formats are also common. raw , or plain format is composed of, as
their names suggest, simply the sequence on its own. Certain applications will only accept
this type of format, but it does not hold an ID or sequence name, and cannot, be used in a
multiple sequence file as there is no indication of where the sequence starts or ends. GCG
format was devised for the GCG package, and has several lines of description, together
with sequence numbering. One of the features of a GCG format is a “checksum”. This is a
number relating directly to the sequence, and was implemented in the days when file
transfer was less reliable than today. The intention was to allow researchers to know that
they transferred an intact and correct sequence. In the current phase of more reliable
networking, this has proved a hindrance in many cases, as the sequence cannot be manually
edited before being input into another GCG, or other software application. Standard report
formats are used for alignment output. For example gff format for protein features and
markx format for alignments.The versatility of EMBOSS means that any sequence format
can be read into an appropriate application, or may be changed to an alternative format.
In addition to allowing access to sequences in many formats stored in local files and
sequences stored in locally managed databases, EMBOSS is able to access sequences stored
Search WWH ::




Custom Search