Biology Reference
In-Depth Information
the Baum-Welch algorithm capable of recovering the HMM parameters for (a) short
sequences? (b) long sequences?
Exercise 9.17. Repeat Exercise 9.16 , this time using the Baum-Welch algorithm for
the CpG Islands application in CpG Educate (a template CpG_HMM.csv to save and
upload the HMM parameters from a file is available for download from the volume's
website).
9.4.4 Post-Processing
The decoding methods described in this section are purely mathematical and they
may not produce completely accurate results when applied to CGIs. Both Viterbi and
the posterior decoding methods impose no restrictions on the length of the identified
islands or check whether biologically important conditions such as high % C
+
G
content or high O
ECpG ratio (see Section 2) are met. When HMMs are used
for CGI identification, the consideration of these properties is done during the post-
processing stage. At this stage we turn back to the genomic properties of the CGIs
that have not been modeled by the HMM. This stage usually includes performing one
or more of the following refinements:
/
- Combine CGIs separated by short gaps : Neighboring CGIs that are separated by
small gaps of non-island regions are merged into a single larger island. A minimal
distance threshold between islands is set in advance and neighboring islands
closer than this threshold value are merged. The selection of the threshold values
used in the reported literature varies from about 15-20 [ 27 ]toupto100[ 8 ].
- Check for minimal % C
E CpG ratio : Check to see if
the islands identified by the decoding methods meet the biologically relevant
thresholds for % C
+
G content and O
/
ECpG as described in Section 2. If the
identified CGIs do not meet those threshold value requirements, those states will
be relabeled as non-islands.
- Check for minimal length : As discussed in Section 2, short sequences labeled as
CGIs are not of biological interest. Different length-threshold values are used in
the literature but those are usually in the range 140-500 bp [ 8 , 27 ]. If the length
of a predicted CpG island is less than the threshold value, those states will be
relabeled as non-island.
+
G content and O
/
Post-processing is then applied to filter out the regions that do not meet the biological
criteria for CGIs with cutoffs.
Example 9.10.
In [ 27 ] the authors use the general HMM with states Q
=
{
that
we considered earlier, with no restrictions on the transition or the emission matri-
ces, training it on a set of 1000 sequences from the database embl173hum of all
A + ,
A ,
C + ,
C ,
T + ,
T ,
G + ,
G }
and emission symbols M
={
A
,
C
,
T
,
G
}
 
Search WWH ::




Custom Search