Tandem mass spectrometry database searching (Proteomics)

1. Introduction

Mass spectrometry (MS) has become a principal technology in proteomics analysis largely due to improvements in sequence database availability and development of software tools to interpret mass spectra. Software tools of the modern proteomic era were first published around 1990 (Johnson and Biemann, 1989; Bartels, 1990; Yates et al., 1991). These programs facilitate the de novo sequencing process that became possible after collision-induced dissociation (CID) was shown to induce peptide fragmentation primarily along the amide backbone. In 1993, the era of sequence database searching began using protein fingerprinting or peptide mass mapping, with the development of software to query sequence databases with protein digestion products. This method remains in use today as it is fast and relatively accurate, but typically requires purified proteins as input substrate such as those excised from a gel spot or band. Despite these advances, the rapid adoption of MS as a proteomics platform is directly correlated with the automation of computational tools for the analysis of spectra from peptides analyzed by tandem mass spectrometry. As a peptide-centric identification strategy, MS/MS database searching or peptide fingerprinting allows for the analysis of mixtures of unseparated proteins such as those from protein complexes or whole-cell lysates. This review will touch on computational methods associated with peptide identification using MS/MS sequence database searching.

One of the earliest attempts to combine MS/MS analysis with database search strategies was published by Mann and Wilm (1994). Their method queried sequence databases with “peptide sequence tags”, which are short amino acid sequences obtained by partial de novo sequence interpretation of tandem mass spectra combined with the corresponding mass constraints of the start and end mass of the sequence interpretation within the tandem mass spectrum. Sequence tag searches can be performed quickly because string matches can be coded efficiently and the database search space is typically limited to the analysis of peptides generated from an in silico enzymatic digestion of the database entries. Additionally, by relaxing search constraints such as the prescribed peptide mass, the sequence tag approach can identify interesting posttranslational modifications and homologous sequences (Shevchenko et al., 1997). The major drawback to the sequence tag approach is the tag generation process as it has historically been performed manually. This is a rate-limiting step particularly when one considers the extraordinary rate at which modern mass spectrometers can acquire tandem mass spectra. Sequence tags can be generated in an automated fashion; however, automated tag-generation success rates are typically limited to good-quality spectra in the tag interpretation region and, in the case of ions generated by electrospray ionization, to the tags that are generated from a specific precursor ion charge state (Tabb et al., 2003).

The year 1994 also saw the first published method of querying uninterpreted tandem mass spectra of peptides against sequence databases (Eng et al., 1994). The embodiment of the described program, SEQUEST, uses a two-stage score function. The first stage generates a preliminary score for all peptides in a sequence database on the basis of summed ion intensities of a processed spectrum, with preliminary score adjustment factors for ion continuity, immonium ion contributions, and the percentage of fragment ions matched. The second stage of the score function, applied to the 500 best preliminary scoring peptides, generates a cross-correlation score between each theoretical and acquired tandem mass spectra. The cross-correlation score used in SEQUEST is effectively the correlation coefficient or vector dot product computed between the two spectra with a correction factor. Putative peptide identifications are ranked and reported on the basis of the cross-correlation score. Information from the preliminary score, such as the preliminary score rank and percentage of matched fragment ions, are also useful indicators for assessing positive identifications and are reported as well. Subsequent publications describe applications of the algorithm to identify posttranslational modifications (Yates et al., 1995a), query of nucleic acid sequence databases (Yates et al., 1995b), application to spectra generated by different instruments (Griffin etal., 1995; Yates et al., 1996), implementation to run on parallel computer clusters (Sadygov et al., 2002), and a modification to normalize the cross-correlation score for length dependence (MacCoss, 2002). SEQUEST is among the most widely used search algorithms.

In 1999, Perkins etal. published a search tool named Mascot that queries uninterpreted tandem mass spectra among one of its search options (Perkins et al., 1999). The Mascot score function is based on the MOWSE scoring algorithm (Pappin etal., 1993) previously used in peptide mass fingerprinting but modified to be probability-based. Although the direct details of the implementation of the score function are unpublished, it is likely that statistics on fragment ion frequencies, the direct extension of the MOWSE peptide mass statistics, are the major component of the score function. It has been noted that for each candidate peptide, Mascot iteratively searches a spectrum’s peak intensities to find the set of most intense peaks that yields the maximum score. Using this method, the optimal number of peaks in a spectrum are independently determined and scored for each peptide in a database. The database search scores are interpreted as p-values and significant peptide identifications are based on thep-value scores. Owing to its independent development, fast searches, and probability-based score function, Mascot has garnered popularity as it facilitates the analysis of data acquired from most of the major instrument manufacturers.

Bafna and Edwards (2001) implement a probabilistic model for MS/MS database searching in a program named SCOPE. The score function in SCOPE incorporates fragment ion probabilities, spectral noise, and instrument measurement error. The authors employ a two-stage stochastic model for the observed MS/MS spectrum, given a peptide. The first step generates fragment ions from peptides, given a probability distribution estimated from training data. For this step, sufficient training data were not available to learn the values empirically so the authors defined the probabilities on the basis of consultation with “experienced mass spectrometry operators”. The second step generates a spectrum using these fragment ions according to a distribution based on known instrument measurement error. Peptides are scored by the probability they would give rise to the observed spectrum, as opposed to all alternative spectra. Dynamic programming is used to efficiently sum up contributions from each possible pattern of peptide ion fragmentation. Since the scores for each peptide with respect to an observed spectrum are computed independently of all other peptides under consideration, it is not impossible for several peptides to receive equally high scores. The authors thus compute an additional p-value aimed at estimating the likelihood that each peptide score arose by chance. Although only 28 MALDI-TOF-TOF test spectra are used to evaluate the scoring scheme, the prototype implementation demonstrates the effectiveness of the model as 26 of the 28 spectra are identified correctly with scores significantly separated from the incorrect peptides. Additionally, the two incorrect identifications exhibited scores not well separated from the other top-scoring peptides in each particular search. This indicates that the significance test can differentiate correct from incorrect identifications, at least in this small set.

Havilio et al. (2003) describe a computational method using a statistical scoring approach based on the probability of detecting fragment ions in the experimental spectrum that incorporates a measurement of ion intensity. A set of score functions is developed using parameters incorporating both experimental observations and theoretical knowledge of peptide fragmentation. Examples of parameters used in the score functions include terms for random or noise peaks, the relative intensity of a given fragment ion peak, and the relative frequency of unmatched fragment ion peaks. The authors generate their own test set of highly probable peptides to evaluate fragment ion statistics. To do so, a large dataset collected from 46 LC-MS/MS runs of an ion trap mass spectrometer is analyzed. The score function used in the de novo sequencing program described in Dancik et al. (1999) is implemented and the criteria for highly probable peptides, those that pass a p-value of .01 cutoff, exhibit over 50% matched fragment ions, and are greater than 10 amino acid residues in length, are used to generate statistics for Havilio’s model. These include, for each fragment type, the probability of detecting an ion of a given intensity across a range of fragment mass to peptide mass ratios, the probability of failing to detect a fragment ion across the fragment mass to peptide mass range, and the probability of a random ion of a given mass to have a particular intensity. The authors implement two score functions, intensity-based and non-intensity-based, using these statistics, with specific details described in the manuscript. Presumably tested on the same data used to generate the ion statistics, the manuscript shows that the intensity-based statistical scorer exhibits significant improvements over the non-intensity-based scorer, the initial score function of Dancik et al., and an implementation of the cross-correlation score function as presented in Eng et al.

An approach to identify peptides through sequence database searching using Bayesian statistics is implemented by ProbID (Zhang et al., 2002). This Bayesian approach attempts to quantify the likelihood that the query spectrum is generated from a given candidate peptide sequence in the database. Background information used in computing the probability includes the protease used to digest the input sample (which is used in the score function and not as a peptide filter in the database search itself), immonium ion presence, matched and unmatched ions, and consecutive or complementary matched fragment ions referred to as a pattern match. Relatively simple models are used to compute contributions for each of the background information types. For example, a normal distribution around each theoretical ion is used to weigh matched ions on the basis of their distance from the theoretical peak. The standard deviation of the normal distribution is defined by the user-specified mass accuracy of the input data. Similarly, the total number of unmatched ions is counted and used as an exponent for a term involving just the mass difference between the highest and lowest peak in the experimental spectrum. The authors compare ProbID against SEQUEST on two reference datasets and show that both software tools similarly identify a majority of the peptides correctly. However, a still significant ~10% of the identifications are detected by only one of the search engines. In summary, this manuscript shows that a relatively simple Bayesian probabilistic score function can perform as well as an industry standard tool including identifying spectra not identified by SEQUEST.

Hernandez etal. introduce a new method for protein identification using MS/ MS data, named Popitam, that uses a nondeterministic heuristic approach to confront the combinatorial problems associated with analyzing modified peptides (Hernandez etal., 2003). This problem causes extraordinary increase in search times as sequences that have multiple potential modification sites require an independent search for each combination of modifications. Popitam’s algorithm selects a given number of peaks in a spectrum, transforms the peaks into a spectrum graph, and compares them against a sequence database to generate a list of scored candidate peptides. (A spectrum graph is a mathematical representation in which peaks in an MS/MS spectrum are represented as nodes in the graph. Linking the nodes with edges that differ by amino acid residue masses generates a peptide sequence.) The initial implementation of Popitam, referred to as the “Full Path algorithm”, works by looking for complete paths through the MS/MS spectrum graph where the sequence database is used to direct the search for matched sequences. If the spectrum graph contains gaps of two or more fragmentation positions, the initial algorithm struggles to identify peptides of these spectra because a full path through the graph is required. In Popitam’s “Tag algorithm”, short tags are generated where the sequence database is used to direct and emphasize relevant sections in the graph from which tags can be generated. As the tag discovery process can be combinatorially complex, a heuristic search using Ant Colony Optimization (ACO) is performed. In this implementation, ACO (inspired by the pheromone trail laying and following behavior of real ants) is a method of exploring different solutions to find good scoring paths in the spectrum graphs. The resulting ant score is a combination of terms such as the coverage measure between database peptide sequence and the generated sequence based on parsing the spectrum graph, and a regression score representing the quality of the correlation between the experimental masses included in the path through the spectrum graph and the corresponding theoretical masses computed from the database peptide sequence. With the tag algorithm, combining unconnected relevant sections of the graph in order to maximally cover the theoretical peptides suggests that Popitam will be able to identify mutated or modified peptides without any prior knowledge of the modification.

A fragment ion frequency-based model by Sadygov et al. uses a hypergeometric distribution and is named PEP_PROBE (Sadygov and Yates, 2003). The theory involves formulating a model based on the frequency of fragment ion matches. The hypergeometric distribution is related to the binomial distribution but handles cases of sampling without replacement, in this case, the sampling of matched fragment ions. Four variables are used to calculate the hypergeometric probability. The first term is the total number of theoretical fragment ions in a sequence database for peptides of a given mass. The second term is the number of these theoretical fragment ions that match a peak in the input spectrum. The third term is the number of possible fragment ions for the peptide sequence being scored. The last term is the number of matched fragment ions for the peptide sequence being scored. To assess the significance of peptide matches, the authors implement a p-value that is calculated from the cumulative hypergeometric distribution. An implementation of the hypergeometric algorithm considering only b- and y-ions is tested against a set of approximately 59 000 ion-trap MS/MS spectra that contains 5000 true identifications. An analysis against this characterized dataset shows that false-positive error rate of 50% to less than 5% can be achieved depending on how high the score threshold cutoff is set. However, there is no mention of the corresponding sensitivity rates for each displayed error rate.

The MS/MS database search methods presented above are only a selection of the published tools in this growing field of computational proteomics. MS-Tag (Clauser et al., 1999), which implements the MOWSE score function, mutant tolerant identifications with PENDATA (Pevzner et al., 2001), vector dot product analysis with Sonar (Field et al., 2002), and signal detection theory and extended matches in OLAV (Colinge et al., 2003) are other noteworthy database search tools not covered in detail here. However, research and development in MS/MS database searching has gone beyond simply finding the optimal score functions used in MS/MS database searching. A new class of software tools have recently been developed that do not currently perform MS/MS database searches but rather assist in distinguishing between correct and incorrect database search assignments, as produced by the search tools described above. Three models have been described: one statistically based model, one model centered on machine learning using full datasets, and one intensity-based machine-learning model. A fourth introduces the concept of ion mobility and how peptide composition and charge affect MS/MS fragmentation and thus database search analysis.

Moore et al. have developed an analysis tool called Qscore to evaluate the quality of SEQUEST search results at the protein level that the authors suggest may decrease false-positive assignments and manual curation time (Moore et al., 2002). This algorithm is pseudo-probability based as it incorporates a quality score of an individual peptide match into a calculation of the probability that a protein identification has occurred by chance. The probabilistic portion of the calculation is based on the number of peptides identified from a particular protein, the number of tryptic peptides from the particular protein in the database, and the total number of identified peptides. The quality score portion of the score is calculated in a fashion similar to SEQUEST where the spectrum is divided into a number of small m/z bins. The total ion current of each bin in a normalized measured spectrum is multiplied by the total ion current in the same range in a normalized predicted spectrum and all products are summed to a single value. A perfect match in this algorithm gives a quality score of 1. The quality score is converted to a probability by taking the inverse of 1 minus the quality score. The composite Qscore incorporates this quality score into the probability calculation. Testing by the authors produced good correlation with hand curation with reduced false-positive results.

Keller etal. introduced a method based on the use of the expectation maximization (EM) algorithm to derive a mixture model of correct and incorrect peptide identifications from the data (Keller etal., 2002). The software tool, PeptideProphet, learns to distinguish between correct and incorrect peptide identifications based on observed information, primarily database search scores and other peptide properties, from each peptide in a dataset. Peptide properties that are used to discriminate between correct and incorrect identifications include the difference between the observed versus calculated peptide masses, the number of missed enzymatic cleavage sites, and the number of peptide termini consistent with the expected enzymatic cleavage. A peptide discriminant score optimally combines the individual search scores, such as those produced by SEQUEST, into a single discriminant score. The coefficients of the discriminant function are developed for each type of MS/MS database search program and can be optimized for each type of mass spectrometer. On the basis of the distribution of scores mapped through the discriminant function, PeptideProphet applies EM to find mixture model distributions, Gaussian for the positive population and gamma for the negative population, that best fit the observed data. Other parameters of the data, such as the distribution of correct enzymatic termini exhibited by the positive and negative populations, are also automatically learned. Using the learned distributions of the positive and negative populations, PeptideProphet assigns computed probabilities (as opposed to p-values) that, in an implementation using ion trap data with the SEQUEST database search program, are shown to accurately reflect the confidence of peptide identifications being correct. Such accurate computed probabilities furthermore enable the prediction of false-positive error rates. The algorithm learns the correct and incorrect peptide assignments from the information in each individual dataset, rather than relying on a single set of universal parameters, which makes it tolerant of variations in factors such as data quality and protease digestion efficiency. Additionally, because PeptideProphet can be extended to analyze data generated from various mass spectrometers and run through different search engines, it can serve to standardize publication criteria and facilitate comparisons of large-scale datasets, particularly if data of a constant error rate are reported. It should be noted that the authors extend this peptide analysis to identifying proteins with the tool ProteinProphet (Nesvizhskii et al., 2003).

Elias et al. describe an intensity-based model developed using a machine-learning algorithm on 27 000 high-quality 2+ peptide-spectrum matches (Elias et al., 2004). This strategy uses a decision tree to model the probability of seeing a fragment ion based on 63 peptide fragment attributes. The decision tree was also applied to a set of mismatched peptides (the second best peptide from the high-quality matches). The authors calculate the likelihood of observing each fragment ion from a potential peptide match using both trees and calculate an odds ratio for that fragment ion. The individual odds ratios for each peptide are summed to generate a cumulative odds ratio (LOD score) for that peptide match. The authors tested their LOD score against other commonly used single scoring tools, such as SEQUEST’s XCorr and dCn, and were able to demonstrate better sensitivity (proportion of correct identifications noted as correct) and specificity (proportion of incorrect identifications noted as being incorrect) than any of the commonly used strategies. In the authors’ analysis, the LOD score performed approximately as well as the discriminant score used in PeptideProphet and when combined with the Xcorr and dCn had a slight advantage in specificity for a given sensitivity; comparisons against the full PeptideProphet analysis were not performed. The authors tested a publicly available dataset previously evaluated using a standardized scoring system and report finding 24% more correct identifications at the same false-positive rate using a LOD-based analysis.

Kapp et al. mined a database of 5500 tandem mass spectra of unique peptides obtained by tryptic digestion in order to determine which factors affect tandem MS fragmentation under low-energy CID conditions (Kapp et al., 2003). A residue’s preference for cleavage (N- or C-terminus) was determined on the basis of two independent methods. First, a cleavage intensity ratio (CIR) value is calculated to quantify the extent of fragmentation occurring both N- and C-terminal to each amino acid residue within a peptide. The CIR is calculated by dividing the summed intensity for each amide bond cleavage by the average intensity of all cleavage sites within a peptide. CIR values greater than 1 indicate enhanced cleavage and below 1 indicate reduced cleavage. Second, a linear model that takes into account positional factors as well as peptide length was used to determine whether certain residues showed a propensity to cleave N- and/or C-terminal. Linear regression was performed to estimate effects of variables, such as the relative position of the cleavage along the peptide backbone and the specific effect of the adjacent amino acid on either side of the cleavage site, in order to select and retain only those variables that have a real effect on the fragmentation process. On the basis of using both CIR values and the linear model, the authors have classified peptides as nonmobile if the number of available protons is less than or equal to the number of arginine residues. For the remaining peptides, if the number of available protons was greater than the number of arginine residues but less than or equal to the total number of basic residues (arginine, lysine, and histidine), these were classified as partially mobile, otherwise, they were classified as mobile (more protons than basic residues). This has been termed the relative proton mobility scale. Analysis of SEQUEST and Mascot against the 5500 unique manually validated peptide MS/MS spectra shows that the search score varies considerably depending on the charge state and proton mobility classification. In relation to MS/MS database search, two conclusions can be drawn from this analysis. The first is that search algorithm cutoff filters or score thresholds perform poorly for certain classes of spectra (e.g., nonmobile peptide), so score cutoffs should be reevaluated by taking into account the relative proton mobility scale described. Second, score functions can be improved by taking into account fragment ion abundances that are based on sequence, charge state, and proton mobility classification.

2. Conclusion

Database search software has made extraordinary strides in the last 10 years. Numerous programs now exist to allow rapid assignment of peptides of unin-terpreted spectra. Many such programs are quite good and easily implemented for both large and small laboratories. Comparisons show that a number of the best search programs are fairly similar in their performance, each finding roughly similar numbers of peptides and assigning identifications to a few spectra “invisible” to other algorithms. Novel algorithms show promise of higher sensitivity at a given error rate compared to existing tools but need to be validated for real-world applicability. The future will almost certainly bring advances in allowing assignment to modified peptides as well as a refinement of search criteria as large datasets become available for testing. Further, the era of manual curation is coming to a conclusion, with the development of software that can assign a quality or probability score to a peptide assignment on the basis of a variety of spectral features or search scores. These developments will likely culminate with tools that allow rapid optimization for a set of experimental conditions such as MS platform and enzymatic treatment, ultimately streamlining sampling processing in both high- and low-throughput environments.

Next post:

Previous post: