Contig mapping and analysis (Bioinformatics)

The creation of ordered sets of overlapping clones or “contigs” has historically been the goal of chromosome walks in gene hunting and more recently in providing tiling paths of clones for whole genome sequencing. Various methods have been used to establish clone overlaps, including simple cross hybridization. In radiation hybrid mapping, the overlaps between the propagated large genomic fragments are detected through shared Sequence Tags Sites (STSs) detected through PCR amplification (Cox et al., 1990). Currently, the most common way to establish clone overlaps is through shared restriction fragments.

Restriction enzyme fingerprint mapping has undergone a number of iterations in its development. However, the basic concept remains the same. A set of clone fragments is derived by cutting the clone with a restriction endonuclease that matches a very short (4 or 6 bp) nucleotide sequence. The fragments from each clone are then sized using an electrophoresis technique. Clones that overlap will share a number of similarly sized fragments. The earliest methods used radioactive labeling and acrylamide gels (Nathans, 1979). Using an essentially similar method, Coulson et al. (1986) generated a preliminary version of the Caernorhabditis elegans physical map by using Cosmid clones libraries. The contiguation of this map was later substantially improved through the addition of YAC clones to the contigs (Coulson et al., 1988). The larger insert size of YAC clones allowed them to bridge gaps in the coverage of the Cosmid clone map. The relatively small size of cosmids, approximately 30-40 kb, remained a limitation in the ability of such fingerprinting techniques to map large plant and mammalian genomes until the development of Bacterial Artificial Chromsomes (BACs) with a much larger insert size of between 180 and 200 kb (see Article 3, Hierarchical, ordered mapped large insert clone shotgun sequencing, Volume 3). Other improvements, such as the use of agarose instead of acrylamide gels also made the approach amenable to high-throughput data generation and has been used to create physical maps for a number of mammalian genomes including human (McPherson et al., 2001), mouse (Gregory et al., 2002), and rat (Krzywinski et al., 2004). In this chapter, we will discuss the methods and approaches used in contiguating clones on the basis of restriction fragment fingerprint information.

A genomic library consists of a set of overlapping clones representing coverage of the whole genome. By identifying common overlap regions between clones, clones that have adjacent positions in the complete genome may be found and joined together in a common contig. The level of redundancy in the genome coverage of a library is often quoted as “X coverage”. For example, in a 4X genomic library, any section of DNA is represented on average in four different clones – see Figure 1. The average coverage can be calculated by simply multiplying the average insert size of the library by the number of clones in the library and dividing by an estimate of the genome size. Since short clone overlaps are difficult to reliably detect, the joining of the clones is more likely to be reliable with a high redundancy of clone coverage, which will also reduce the number of gaps in clone coverage.

To generate a restriction enzyme fingerprint map, the clone insert is cut by a restriction endonuclease that recognizes a short, specific nucleotide sequence. This breaks the clone into a series of fragments, the largest of which are around 30 kb in length. By separating the clone fragments using an electrophoresis gel, it is possible to size each of the fragments. This record of the sizes of fragments represents the fingerprint of the clone. Owing to limitations of gel electrophoresis, fragments that are less than 600 bases in length are not recovered. By fingerprinting all of the clones using the same restriction enzyme, clones with similar fingerprints, that is, those containing similarly sized fragments, can be identified and they will likely overlap. The probability that two clones overlap increases as a function of the number of bands their fingerprints have in common. The Sulston score (Sulston et al., 1988) defines that probability that im or more bands matching between two fingerprints do so by chance in the following way:

Figure 1 A representation of clones in a genomic library with 4X coverage. Each colored line represents a different copy of the original genomic DNA. Each fragment in each line represents an individual clone. Overlaps between clones can be used to identify the two clones as neighbors

where nbandl is the number of bands in the clone with the fewer bands, nbandh is the number of bands in the clone with more bands and

The tolerance is a measure of the experimental error in determining the position of the band (and hence the size of the fragment). The equation assumes that all bands are equally likely. Hence, the probability of a single band in the gel with fewer bands matching one band in the gel with more bands is given by

which leads to equation (2), which is the probability that a single band in the fingerprint with fewer bands matches no bands in the other fingerprint. Since the number of matches between the fingerprints will follow a binomial distribution, the extension of equation (2) for all bands leads to the Sulston score, equation (1).

The Sulston score forms the basis for the FPC (FingerPrinted Contigs) software (Soderlund et al., 1997). Figure 2 shows a screenshot from the FPC software. This software is used to manage the process of creation of clone contigs. FPC performs an automated fast assembly of the clones producing a set of ordered and “buried” clones. A buried clone is one that contains no identifiable unique fragments and whose entire sequence may be wholly represented by another clone. In this case, the clone is “buried” in the clone that contains it. Clones that contain at least one unique fragment cannot be buried and are referred to as canonical clones (Figure 3).

The FPC algorithm works by first calculating the Sulston score for each pair of clones, burying clones that are an exact or a close match to another clone, and then building a consensus band map (CB map). The CB map is initially built using a hybrid greedy/stochastic algorithm to identify adjacent clones, which is then greedily extended.

Unfortunately, both the experimental methods and computational methods for identifying and sizing clone fragments are subject to experimental error. Therefore, it is not possible to create a perfectly ordered CB map and the CB map created by FPC must be refined by manual methods. A common problem is when a band is missed (undercalled) or a band is identified when none is present (overcalled). This problem is particularly evident when bands are very close together such that they cannot be resolved as separate bands on the gel, but may also be due to faint bands. A human map analyst can use FPC to review gel images and reorder clones by correcting for overcalled or undercalled bands to create the most likely clone order. This is a manually intensive task and is the bottleneck in the production of high quality maps. Correcting for these artifacts in an automated manner is an active area of research and requires either improving the data quality (Fuhrmann et al., 2003) or improving the ordering and burying algorithm (Flibotte et al., 2004). Bandleader analyses the pixel values of a scanned gel image to identify bands. By assuming a Gaussian signal for a band, it is able to improve the resolution of image using standard signal processing techniques. CORAL uses a machine-learning algorithm to improve the burying and ordering of an FPC map.

Figure 2 A screenshot from the FPC software. The clone names (e.g., H005D09) run across the page near the top of the figure and the fragment mobilities (related to fragment size) are on the left-hand side of the figure. The scanned gel images for each clone run down the page. On either side of the scanned image are horizontal bars that show where the band-calling software has identified a band. The figure shows examples of faint bands that have not been called by the software (undercall) and thick bands where the number of fragments composing the band is unclear and may result in too many bands being identified (overcall)

Figure 3 The different clones in Figure 1 have now been joined together to form a map contig. Dotted clones have their sequence fully represented by another clone are termed buried clones. Clones that cannot be buried (solid line and alternating dash/dot) are called canonical clones. From the canonical clones, a minimal set of tiling path clones have been selected (solid line) that represent all of the genomic DNA covered by this map contig with little or no redundancy. Note there are alternative clones that could have been selected when choosing clones for the tiling path

Another common problem is the creation of chimeric clusters that contain clones from a different part of the genome. These clones have been placed together because they have high Sulston scores by chance. FPC provides a framework and the necessary software tools for the map analyst to manipulate the clone order, change buried clones, and tease apart chimeric clusters.

Further difficulties are caused by genomes that contain a large number of repeat regions, such as plant genomes. The repeat regions tend to be compressed in fingerprint-based physical maps since if the genome contains the enzyme site, a repeat region will cause the overrepresentation of a band in the fingerprint, while if there is an absence of an enzyme site, this will lead to a large fragment that is not cut by the enzyme (Chen et al., 2002). There have been some attempts to improve the ordering and burying provided by the raw Sulston score. Soderlund and coworkers incorporated marker data into the original FPC software (Soderlund et al., 2000). The markers (for example, based on STS data) were used in conjunction with the fingerprint data to enhance the score. If two clones share markers, the cutoff of the Sulston score for the clones to be related is lowered by an order of magnitude for each marker they share. A further improvement to FPC has been the creation of a parallel version that distributes the calculation of the Sulston score for all pairs of clones across the nodes of a cluster (Ness et al., 2002). This reduces the time to create an FPC map making it feasible to optimize build parameters by running multiple FPC builds.

Software available for viewing an FPC map includes WebFPC (available from http://www.genome.arizona.edu/software/fpc) and Internet Contig Explorer (Fjell et al., 2003). Further tools allow virtual fingerprints to be created for sequenced clones for addition to the map and the extraction of a minimal tiling set from the map (Engler et al., 2003).

As described above, the final set of clones will consist of a number of buried clones and an ordered number of overlapping, canonical clones. Depending on the depth of sampling of the genomic library, several clones may represent any one region of the genome. By identifying a minimum tiling path for the clones (see Figure 3), a set of canonical clones that cover the entire genome with minimal overlap may be extracted. The minimum tiling path may be used as a starting point for sequencing the genome, providing a set of clones that have the same coverage of the genome, but with lower degree of redundancy than the original library, thereby reducing the sequencing cost. Clone tiling paths are also useful in the generation of comparative genomic hybridization (CGH) arrays (see Article 23, Comparative genomic hybridization, Volume 1) and for fluorescent in-situ hybridization (FISH) experiments (see Article 22, FISH, Volume 1).

More data can be provided by end sequencing of the clones. In this process, a sequence read is generated from each end of the clone. These reads may be used to help align the physical map with an assembled genome sequence. By aligning the end reads against the sequence, for example by using BLAST sequence alignment program, the clone’s position in the genome sequence may be defined (Engler et al., 2003). By tying the map assembly to the sequence assembly, it is possible to uncover misassemblies in both the sequence and the map and to identify clone resources that can be used to fill sequence gaps.

This verification can be performed at several levels. Large-scale rearrangements will result in the clone order as determined by the fingerprints to be different from that specified by sequence assembly. The insert size of the clone (i.e., the length of DNA sequence that was inserted into the BAC or YAC) can be determined by summing the sizes of fragments in the fingerprint. This insert size will be an underestimate since small fragments (less than 600 bp) cannot be measured.

The insert size can be calculated for each clone individually or alternatively an average insert size can be calculated for the library (insert sizes for high-quality genomic libraries follow a pseudonormal distribution). Since reads from each end of the clone were taken, the sequence assembly can be used to calculate the insert size and this may be compared to the known insert size or average for discrepancies. The Arachne software is capable of incorporating this information (Batzoglou et al., 2002).

Resolution at a higher level may be performed by digesting with the assembled sequence in silico to create virtual fingerprints (i.e., finding cut sites in the assemble sequence that match the recognition site of the enzyme used to create the real fingerprints). The in silico fingerprints from the sequence assembly may be compared to the real fingerprints to identify discrepancies. The resolution of this type of analysis may be extended even further by creating additional fingerprints of the clones cut with different enzymes. See Article 8, Genome maps and their use in sequence assembly, Volume 7 for more details on how map data may be used to improve sequence assemblies.

There are a number of efforts ongoing to apply fluorescent labeling in conjunction with capillary electrophoresis using a DNA analyzer to automatically size fragments. The latest methods extend this technique by performing multiple digests using different enzymes and labeling the ends of the fragments to identify fragments cut by the same enzyme (Luo et al., 2003) or to sequencing 1 -4 bases at the 5′ end of the fragment (Ding et al., 2001). Although a DNA analyzer presents an upper limit on the size of fragments that can be measured (about 600 bp), the measurements are much more accurate than those obtained by gel electrophoresis. Hence, the value of the tolerance used in equation (2) is smaller, resulting in a stricter Sulston score probability. This factor and the additional information that these methods provide about each fragment greatly improve the ability to identify related clones. These methods represent the next generation in fingerprinting technology and will allow for substantial improvements in automated methods of contig mapping and analysis.

With the improvement in whole genome shotgun assembly techniques, the generation of a physical map is no longer a necessary precursor to genome sequencing. However, the physical map remains a valuable resource for aiding the assembly of a genome. The correct assembly of a genome sequence from sequence data alone remains a difficult task and the physical map can act as a framework for the assembly. Furthermore, physical maps can be used to generate a set of clones that cover the entire genome with less redundancy than the original genomic library (known as a set of tiling path clones). Since the generation of a clone fingerprint remains relatively cheap, reducing the redundancy of the clones that are to be sequenced can reduce the cost of the sequencing operation.

Contig mapping and analysis (Bioinformatics)

Related Links

:: Search WWH ::