Applications of Peptide Retention Time in Proteomic Data Analysis




© Springer Science+Business Media Dordrecht 2015
Youhe Gao (ed.)Urine Proteomics in Kidney Disease Biomarker DiscoveryAdvances in Experimental Medicine and Biology84510.1007/978-94-017-9523-4_7


7. Applications of Peptide Retention Time in Proteomic Data Analysis



Chen Shao 


(1)
National Key Laboratory of Medical Molecular Biology, Department of Pathophysiology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, 5 Dong Dan San Tiao, Beijing, China

 



 

Chen Shao



Abstract

In proteomic studies, liquid chromatography is commonly used to separate peptide mixtures prior to mass spectrometry (MS) detection. As an independent dimension of information from the information provided by the MS, peptide retention time information has been proven to be able to aid proteomic data analysis in many aspects. So far, some popular software has offered options for this information for MS data acquisition and analysis. This chapter is a brief review of current methodologies of retention time prediction and application in proteomic analysis.


Keywords
Retention timePeptide identificationQuality control



7.1 Retention Time Prediction


A peptide’s retention time (RT) is defined as the length of time elapsed from the injection of a sample into the chromatography system to the detection of peak maximum of a peptide. It depends on its chemical structures of peptides, along with the interaction between the environment (mobile and stationary phase, temperature, pH, etc.). Therefore, peptide RTs in a particular liquid chromatography (LC) condition can be predicted based on chemical structure-related properties of peptides, such as amino acid composition, sequence, hydrophobicity, and other physicochemical properties [1].

The task of RT prediction is to calculate a retention scale for each peptide in the given LC condition, e.g., to calculate the hydrophobicity scale in reverse-phase LC. A simple idea is to measure or predict retention coefficients for individual amino acids, and then, the retention scale of a peptide is predicted as the sum of retention coefficients of its constituent amino acids. The amino acid retention coefficients can be predicted either by a set of synthetic peptides with residues substituted by each of the twenty amino acids [9] or linear regression models based on peptides with various amino acid compositions [2, 21, 22, 31].

In the recent years, prediction models were refined by employing peptide sequence information and more intelligent computational algorithms, as well as large size of datasets that could prevent the problem of overfitting in data training [16, 27]. N-terminal residues were found to be influence factors to peptides’ retention behavior due to the ion-pairing retention mechanism [19]. Taking into account of this effect, Krokhin et al. developed a widely used prediction model, sequence-specific retention calculator (SSRCalc) [16]. This model added a series of sequence-related correction factors to the previous model that predict peptide retention scales by the summation of individual amino acid retention coefficients [9]. Besides three of the N-terminal residues, these correction factors included C-terminal residues, nearest-neighbor effect of charged side chains (Lys, Arg, and His), peptide length, isoelectric point, hydrophobicity, propensity to form helical structures, etc. Another comprehensive model was built by Petritis et al. [27] based on artificial neural network. Similar to SSRCalc, their model embodied peptide properties such as length, sequence, nearest-neighbor amino acids, hydrophobicity, and hydrophobic moment, as well as predicted secondary structures as the input nodes of the neural network. Some other prediction models were developed in similar idea, but with different choices of peptide properties and statistical models [15, 29, 23].

The refined modes improved the prediction accuracy (R2) significantly from approximately 0.91–0.92 to 0.96–0.98 [17]. However, these conclusions were based on limited size of datasets and reported by the authors themselves. A blind comparison of the most updated versions of prediction models would help greatly in the selection of proper prediction model for practical use. Besides, considering that models based on sequence information and intelligent computational algorithms often require a lot of computational time and large size of training datasets, the simpler and linear prediction models that provide less, but also sufficient prediction accuracy may be selected in some cases, such as on-the-fly RT prediction and calibration [10].


7.2 Application of RT Information in Proteomic Analysis



7.2.1 Peptide Identification Based on LC-MS Data


Accurate mass and time tag (AMT tag) is a well-known strategy to identify peptide sequences based on LC-MS data, which was firstly invented to identify the Deinococcus radiodurans proteome [34, 38]. Given the fact that many possible peptide species are unlikely to be detected in a particular biological system, this strategy assumes that peptides that are detectable in a biological system can be separated by a two-dimensional mass and RT vector [44]. Two main steps are included in this strategy. In the first step, an AMT database for a particular organism or type of biological sample is constructed based on high-confident peptide identifications from previous replicate LC-MS/MS analysis. Secondly, peptides are identified from LC-MS experiments by matching measured mass and normalized elution time (NET) features to the existing database.

There are similar methods that are also identify peptides based on the accurate measurements of mass and RT [11, 24, 41]. These methods do not need to construct a reference database prior to peptide identification. Instead, features are matched by measured mass and RT between different LC-MS/MS runs. Then, peptide identifications from MS/MS spectra can be transferred from one single run to the others. In a study of urinary proteome [25], using “match between runs” option implemented in MaxQuant software [3], the authors were able to increase number of protein identifications from an average of 462 to 633 in a single run.

Saving the effort from MS/MS analysis, AMT tag and similar methods can improve the efficiency and coverage of proteomic analysis. The success of these methods depends on the complexity of biological system as well as the resolution of both MS instruments and LC systems. False discovery rate (FDR) or confidence of peptide identification can be estimated by decoy database searching (shifting masses of all peptides in the AMT database by a certain value) [28] or statistical models [20, 37, 43]. Study of computational simulation showed that for organisms with relative small proteomes, such as Deinococcus radiodurans, modest mass and RT accuracies were sufficient for confident peptide identifications by the AMT tag strategy. For more complex proteome, such as human proteome, more strict criteria should be used. The majority of proteins could be uniquely identified within the tolerances of 1 ppm for mass and 0.01 for NET [26].


7.2.2 Peptide Identification from MS/MS Spectra


RT information has been used to improve peptide identification from MS/MS spectra in several ways. One strategy is to incorporate RT information into a discriminant function along with other peptide-spectrum matching parameters, such as SEQUEST scores [39]. This discriminate function was trained based on data from a known protein mixture. When applying to human plasma proteome analysis, it achieved a 16 % increase of positive peptide identifications.

Predicted RT information can serve as a validation parameter for peptide identification results generated by database searching programs. Kawakami et al. [12] validated peptide identifications by the correlation between measured and predicted RTs. Peptide identifications within a certain correlation tolerance were accepted as high-confident identifications. Several studies reported that number of true positive peptides increased significantly by the combination use of RT filter and lower threshold of database searching score [15, 29, 33].

Besides the application of predicted RT information, Sun et al. built up an empirical RT database based on high-confident peptide identifications from repeated LC-MS/MS runs of a urine sample [40]. This database was used to validate MS/MS identifications for new urine samples. The bottleneck of the empirical database method is that it can only be applied to peptides that were previously detected in a particular proteome, whereas every peptide sequence can have a predicted RT value. However, this method still has its value because it avoids the problem of incorrect RT prediction, which is evitable due to the complex nature of peptide retention behavior.


7.2.3 Post-translational Modification Identification


PTM on a peptide alters not only its molecular mass, but also its physicochemical property (e.g., hydrophobicity), resulting in RT shifts. The RT difference between modified and unmodified peptide (ΔRT) provides a new dimension of information in additional to mass shift (ΔM) in PTM identification.

Previous studies reported lots of instances that peptides with different modification types or different modification sites elute in different RTs [4, 13, 32, 42]. Zybailov et al. [45] depicted the ΔRT distributions of dozens of modification forms detected in a plant proteome. They found that the direction of RT shifts correlated well with the hydrophobicity shifts of the modified peptides for the majority of modifications. Combination of ΔRT and ΔM constrains can efficiently reduce the FDR in PTM identification [32], especially for studies on low-resolution mass spectrometers. For example, deamidation of a peptide results in a mass shift of only 0.984 Da, which could not be accurately distinguished from its unmodified form by a low-resolution LCQ mass analyzer. A study [4] based on synthetic peptide pairs observed that deamidated peptides elute about 3 min later than the corresponding unmodified forms in RPLC. Deamidation detection accuracy was improved from 42 to over 93 % by filtering original SEQUEST identifications by both ΔRT and ΔM constrains.

ΔRT information was also used to improve the algorithms for fast search of unrestricted modifications. The Delta Accurate Mass and Time (DeltAMT) algorithm [7] calculates a two-dimensional delta vector (ΔM, ΔRT) for each pair of spectra obtained in a LC-MS/MS run. The whole set of spectrum pairs are composed of two classes, those from modified and unmodified forms of the same peptide and those from two unrelated peptides. Thus, there are two classes of delta vectors, modification-induced ones and random-induced ones. Bivariate Gaussian mixture models are employed to discriminate modification-induced distributions from random ones. Then, putative modifications could be identified and reported with (ΔM, ΔRT) information as well as the putative modified and unmodified spectrum pairs. Since this algorithm does not use any fragment ion information from MS/MS spectra, it is able to find out high-confident modifications in a very fast speed. However, this algorithm is limited to high abundant modifications, since vector distributions of low abundant modifications are not usually distinguishable from random ones.


7.2.4 Time-scheduled Targeted Proteomic Analysis


Multiple reaction monitoring (MRM) is the method of choice in targeted proteomics. It is a highly sensitive method for accurate quantitation of low abundance proteins in complex protein mixtures. This method needs a sufficient dwell time for each transition to maintain sensitivity and a reasonable cycle time to ensure accurate quantitation. Thus, only a limited number of transitions can be measured in each cycle, limiting its throughput [30]. Time-scheduled transition acquisition (tMRM) offers a solution that can remarkably increase the throughput of traditional MRM experiment without compromising its performance. In this method, the whole gradient time is split into small time windows, and transitions are monitored only in selected windows centered around the expected RT of peptides. Thus, with the same dwell time setting and number of transitions monitored in each duty cycle, tMRM is able to measure many times of transitions in the whole gradient time [36].

A key point to the success of tMRM is to define proper RT window that can capture the entire peptide elution profile from baseline to baseline. This depends on accurate prediction of peptide RTs for each injection. In spite of strict control of the LC system, RT shifts between injections are inevitable, especially when experiments lasting for days to weeks to analysis large amounts of samples. To fit in with the RT shifts, predefined RT windows need to be regularly corrected or repredicted, reducing the efficiency and robustness of tMRM experiment. To aid this situation, on-the-fly RT calibration methods have been developed and integrated in the instrument operating software [8, 14].

This method makes use of a set of well-characterized landmark peptides to calibrate RTs of targeted peptides. Landmark peptides could be either spiked-in synthetic peptides [6, 8] or endogenous peptides that distribute in a broad range of the whole gradient. At any time point, RT windows of subsequent targeted peptides are adjusted based on a local linear regression model generated by the last two eluted landmark peptides. RT windows of peptides elute between the first and second landmark peptides can be simply adjusted by RT shift of the first landmark peptide to calibrate the difference in dead volume. Broad RT windows are set for all landmark peptides as well as peptides elute before the third landmark peptide to ensure that they can be captured without or with minimal calibration.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Nov 3, 2016 | Posted by in NEPHROLOGY | Comments Off on Applications of Peptide Retention Time in Proteomic Data Analysis

Full access? Get Clinical Tree

Get Clinical Tree app for offline access