Biology Reference
In-Depth Information
For an input set with N 0 sequences, PROMALS3D first rapidly
clusters sequences using the programCD-HIT [ 26 ] with a sequence
identity cutoff of 95 % (-c option) and alignment coverage for the
longer sequence of 0.95 (-aL option). This initial step results in N 1
clusters ( N 1
2.1 Initial Clustering
and Reducing
Sequence Redundancy
N 0 ) of highly similar sequences. Clusters with more
than one sequence are individually aligned in a fast way by MAFFT
(with -auto option) [ 27 ]. This step could significantly reduce
computation for datasets with a large number of near-identical
sequences. One target sequence is selected from each cluster. The
N 1 target sequences after initial filtering of highly similar sequences
are subject to further alignment steps described below.
1. PROMALS3D divides the N 1 nonredundant target sequences
into a set of N 2 groups ( N 2
2.2 Dividing Target
Sequences to Groups
and Obtaining Pre-
aligned Groups
N 1 ) and aligns each group
without information from sequence and structure databases.
Two methods are used to obtain the groups. If N 1 is no more
than 200, PROMALS3D uses the UPGMA method to build a
tree based on a crude measure of distances (k-mer counting)
[ 12 ] among the sequences. Given a distance cutoff (-id_thr
option, default: 0.6) the tree is divided into a set of subtrees,
and the sequences in each subtree form a group [ 28 ]. If the
number of formed groups is larger than the maximum number
of groups set by PROMALS3D (-max_group_number option,
default: 60), PROMALS3D automatically adjusts the distance
cutoff so that the number of formed groups is the same as the
maximum number of groups allowed.
2. We observed that the UPGMA method for deducing groups
can produce one or more very large groups when the input
dataset is large (e.g., thousands of sequences). These large
groups may not be properly aligned without using additional
information. Thus, for large sequence input datasets, instead of
UPGMA we used a different method based on K-center clus-
tering to divide the target sequences into a number of groups
when the number of target sequences is more than 200. Our
K-center approach does not allow any group to have more than
200 sequences. This method begins by randomly selecting K
target sequences as the centers of K groups. Then the method
makes iterations of the following two steps. Step (1) is to assign
each target sequence to a group so that its distance to the
center of this group is the smallest among its distances to all
the group centers. Step (2) is to update the center for each
group by selecting a target sequence with the minimum sum of
distances to other target sequences in the same group. Our
modification of this K-center method to control the maximum
size of any group is that any group with 200 target sequences
will not accept new members during Step (1).
3. After dividing the target sequences into N 2 groups, each group is
aligned, resulting in N 2 pre-aligned groups. We have previously
Search WWH ::




Custom Search