Tutorial on tandem mass spectrometry database searching (Proteomics)

1. Introduction

With recent advances in search software and available databases, a novice can now easily perform a tandem mass spectrometry (MS/MS) sequence database search. Such a search requires an input spectrum, a sequence database to query, and a set of search parameters to guide the search. This tutorial will cover the mechanisms of performing an MS/MS search using the Mascot (Perkins et al., 1999) search engine for demonstration purposes.

Figure 1 illustrates the general process of sequence database searching. In general, an acquired tandem mass spectrum of a peptide is compared against theoretical tandem mass spectra of peptides generated in silico from the specified sequence database. The methods by which they are compared, that is, the mathematical scoring algorithms, vary widely between different database search engines such as Mascot, MS-Tag (Clauser et al., 1999), or SEQUEST (Eng et al., 1994; see also Article 3, Tandem mass spectrometry database searching, Volume 5). Nevertheless, the different search engines typically produce quite similar results, and the mathematical “behind the scenes” activity does not affect the fact that there are many features that are common and applicable to running a database search on most search engines.

2. Example spectrum to be interpreted

Figure 2 shows a raw, uninterpreted tandem mass spectrum of a peptide that will be used as the input for this exercise. The information derived from the input spectrum is the mass and intensity pairs that represent each peak in the fragmentation spectrum, the mass (1157a.m.u.) of the precursor ion selected for MS/MS, and the charge state of the precursor ion, if known. The above spectrum was acquired on an ion trap mass spectrometer. The mass spectrometer used for acquisition determines the errors associated with the mass measurements and thus the mass tolerances to be used in the database search. Additionally, the instrument also determines the predominant peptide fragmentation products produced, thus guiding the selection of fragment ion types to consider in the search parameters.


A schematic showing the concept of MS/MS database searching

Figure 1 A schematic showing the concept of MS/MS database searching

Example uninterpreted tandem mass spectrum. The displayed peak list, precursor ion charge state, and precursor or peptide mass are the input for an MS/MS database search

Figure 2 Example uninterpreted tandem mass spectrum. The displayed peak list, precursor ion charge state, and precursor or peptide mass are the input for an MS/MS database search

Any additional knowledge available about the input sample can enhance the potential success of the database search. Specifically, awareness of the organism from which the sample derives permits querying against a species-specific sequence database. Also, because the fragmentation spectra are from peptides, the protease used to digest the intact protein(s) is a valuable information that can be used either as part of the database search parameters, and thus part of the search, or as part of the validation of the search results. Finally, any process that changes the nature of amino acids in the sample such as metabolic labeling or chemical side chain modification, such as cysteine alkylation, must be incorporated into the search parameters (see below).

3. Search parameters

In our current example, there is no suspicion that the peptide is posttranslationally modified; thus, no modifications are specified in the initial query. If processing steps modify the sample, such as cysteine alkylation or metabolic labeling that incorporates an isotopically heavy amino acid, then these modifications should be specified in the search parameters. Modifications can typically be applied to any amino acid and/or the amino or carboxy terminus of a peptide.

Modifications typically are of two types. The first is the static modification that modifies all occurrences of a residue as would be expected with a covalent reaction that goes to completion or a metabolic labeling in which all residues are expected to be replaced. A second type of modification is the variable modification. This type of modification forces the search engine to look for two different forms of an amino acid. A common example is the search for phosphorylation where only a few percent of serine, threonine, or tyrosine residues are actually phosphorylated. In this case, the search is conducted allowing these residues to be either modified (+80 a.m.u.) or unmodified. Note that specifying this type of modification in the search parameters can significantly prolong search times.

The peptide associated with the spectrum in Figure 1 is from Bos taurus (cow), so this spectrum should ideally be searched against a database composed of B. taurus protein sequences. However, if the sample’s species is not known or if the genome of that organism is poorly covered, then a more comprehensive sequence database composed of all species can be substituted, with the hope of finding an identical peptide in a related species. On-line search engines, such as Mascot or MS-Fit, supply a fixed set of sequence databases available for querying against. With a local search engine installation, both publicly available and proprietary sequence databases can be used for the database search.

Figure 3 displays a screen capture of the search parameters defined for this database search. As no cow-specific sequence database is available on the public Mascot search engine, a mammalian subset of the SwissProt (Bairoch and Apweiler, 1997) sequence database is selected. Trypsin is specified in the Enzyme field since trypsin was used to digest the original sample. This specifies that only tryptic peptides of this weight (1157 a.m.u.) in the database will be compared rather than all peptides of this weight. This will dramatically shorten the time required to complete a search. Neither static nor variable modifications are specified. An ESI-TRAP is specified for the Instrument field; the selection of the instrument defines the fragment products to consider in the analysis. Mass tolerances of 2.0a.m.u. and 0.8 a.m.u. are used for the peptide mass and fragment ion masses respectively. The performance of the mass spectrometer defines these mass tolerances used for the search.

 Example Mascot search parameters page. Primary parameter settings are the selection of the sequence database to query, modifications (if any), and acquisition instrument selection

Figure 3 Example Mascot search parameters page. Primary parameter settings are the selection of the sequence database to query, modifications (if any), and acquisition instrument selection

4. Search results

The results of the search are shown in Figure 4. A description of the scoring algorithm and significance determination is beyond the scope of this tutorial but the Mascot search results show no positive identifications because the top ranked peptide, ADSVGKLLTVR received a probability-based MOWSE score of only 24. This score is below the suggested homology and identity significance thresholds of 32 and 41, respectively, for this particular input spectrum. In addition to having a nonsignificant score, the top ranked peptide is from a protein from Mus musculus (MCM7_MOUSE), whereas the input sample was from B. taurus.

There are many potential explanations and remedies for incorrect identifications from unsuccessful database searches. The correct sequence may not be in the database that was queried; searching a larger or different sequence database might resolve the problem. The peptide could be post translationally modified, in which case adding common modifications might be considered in a subsequent search. It is possible that mass tolerances have been set too restrictively narrow with the effect of screening out the correct peptide. In this case, the database queried, lack of specified modifications, and narrow mass tolerances were not the culprits for the incorrect identification. The reason the spectrum failed to be identified correctly was due to the specification of the tryptic enzyme constraint. Even though the original sample was processed with trypsin to generate peptides, nontryptic peptides can exist either because of incomplete digestion, contaminating proteolysis, or in-source fragmentation. When removing the enzyme constraint, Mascot identified the peptide YQEPVLGPVR, a fragment tryptic at only the C-termini, from the protein bovine carbonic anhydrase (CASB_BOVIN). The peptide was identified as correct with a score of 72, compared to the suggested homology and identity thresholds of 49 and 61, respectively, as shown in Figure 5.

Mascot search results page. The top ranked peptide, ADSVGKLLTVR, was from the wrong organism and had a poor score, indicative of an incorrect identification

Figure 4 Mascot search results page. The top ranked peptide, ADSVGKLLTVR, was from the wrong organism and had a poor score, indicative of an incorrect identification

Figure 6 shows the overlay of calculated versus actual fragment ions for the peptide YQEPVLGPVR. This is a common view when validating search results because the matched spectrum should account for most of the peaks in the MS/MS spectrum (see Article 4, Interpreting tandem mass spectra of peptides, Volume 5). However, this validation process can be difficult and extremely subjective, depending on the quality of the input spectra and the experience of the user. One strategy to facilitate the validation process is to use background information on the input sample as a validation tool rather than a search constraint.

Mascot search results page. After modifying the search parameters to allow for peptides resulting from unexpected cleavage, the top ranked peptide, YQEPVQPVR, had a significant score and was identified from the correct organism

Figure 5 Mascot search results page. After modifying the search parameters to allow for peptides resulting from unexpected cleavage, the top ranked peptide, YQEPVQPVR, had a significant score and was identified from the correct organism

A typical spectral display view shows the extent of the match between input spectrum and identified peptide. Note how all of the major peaks are successfully accounted for

Figure 6 A typical spectral display view shows the extent of the match between input spectrum and identified peptide. Note how all of the major peaks are successfully accounted for

For example, the spectrum was identified out of a much larger database composed of all mammalian sequences as opposed to just searching a (smaller) bovine sequence database. The identification of a bovine protein supports the validity of the identification. Additionally, searching without an enzyme constraint can help the validation process because any peptide string from the sequence database can be analyzed (and not just the supposedly tryptic peptides that were generated in the protein digestion process). For an enzyme unconstrained search, high-scoring search results showing the expected tryptic or partially tryptic termini (as in this case) add confirmatory evidence.

5. Conclusions

Researchers new to the field who are interested in exploring proteomics should not be daunted by the complexity of tandem mass spectrometry. MS/MS database searching is an extremely powerful tool in the analysis of both simple and complex protein samples that couples the explosion of available genome sequence information with improvements in mass spectrometry instrumentation for rapid peptide and protein identification. With the efficiency of modern instrumentation and software, it is extremely easy to acquire and search a significant amount of MS/MS spectra, with the effect of shifting the burden of analysis to the validation process. As shown, the steps in performing an MS/MS database search are simple and straightforward. Likewise, search software has become user-friendlier to the neophyte. However, to achieve optimal results, new entrants to these techniques must educate themselves on the choices they make in both running a search (i.e., how search parameters impact the search results) and validating the search results.

Next post:

Previous post: