PROMALS3D: Multiple Protein Sequence Alignment Enhanced with Evolutionary and Three-Dimensional Structural Information - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

For an input set with N 0 sequences, PROMALS3D first rapidly

clusters sequences using the programCD-HIT [ 26 ] with a sequence

identity cutoff of 95 % (-c option) and alignment coverage for the

longer sequence of 0.95 (-aL option). This initial step results in N 1

clusters ( N 1

2.1 Initial Clustering

and Reducing

Sequence Redundancy

N 0 ) of highly similar sequences. Clusters with more

than one sequence are individually aligned in a fast way by MAFFT

(with -auto option) [ 27 ]. This step could significantly reduce

computation for datasets with a large number of near-identical

sequences. One target sequence is selected from each cluster. The

N 1 target sequences after initial filtering of highly similar sequences

are subject to further alignment steps described below.

1. PROMALS3D divides the N 1 nonredundant target sequences

into a set of N 2 groups ( N 2

2.2 Dividing Target

Sequences to Groups

and Obtaining Pre-

aligned Groups

N 1 ) and aligns each group

without information from sequence and structure databases.

Two methods are used to obtain the groups. If N 1 is no more

than 200, PROMALS3D uses the UPGMA method to build a

tree based on a crude measure of distances (k-mer counting)

[ 12 ] among the sequences. Given a distance cutoff (-id_thr

option, default: 0.6) the tree is divided into a set of subtrees,

and the sequences in each subtree form a group [ 28 ]. If the

number of formed groups is larger than the maximum number

of groups set by PROMALS3D (-max_group_number option,

default: 60), PROMALS3D automatically adjusts the distance

cutoff so that the number of formed groups is the same as the

maximum number of groups allowed.

2. We observed that the UPGMA method for deducing groups

can produce one or more very large groups when the input

dataset is large (e.g., thousands of sequences). These large

groups may not be properly aligned without using additional

information. Thus, for large sequence input datasets, instead of

UPGMA we used a different method based on K-center clus-

tering to divide the target sequences into a number of groups

when the number of target sequences is more than 200. Our

K-center approach does not allow any group to have more than

200 sequences. This method begins by randomly selecting K

target sequences as the centers of K groups. Then the method

makes iterations of the following two steps. Step (1) is to assign

each target sequence to a group so that its distance to the

center of this group is the smallest among its distances to all

the group centers. Step (2) is to update the center for each

group by selecting a target sequence with the minimum sum of

distances to other target sequences in the same group. Our

modification of this K-center method to control the maximum

size of any group is that any group with 200 target sequences

will not accept new members during Step (1).

3. After dividing the target sequences into N 2 groups, each group is

aligned, resulting in N 2 pre-aligned groups. We have previously

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home