This repository contains Rmarkdown modules dedicated to statistical analysis of quantitative proteomic data.
They are divided up into 2 projects :
- Instrumental quality control
- Quantitative data analysis
The 1rst project contains modules that allow the controlling of LC-MS devices stability through a serie of acquisitions by analyzing the generated signal for a reference protein (ex : CytoC).
The 2nd project is dedicated to quantitative proteomic data analysis from QC to statistical differential analysis.
All of these modules generates html reports. Examples are provided in _Example/Reports/_.
All modules can be launched by command line thanks to the scripts stored in the "Launchers" folder. You can find working examples of command lines in the "run.bat" file of this folder (works on Windows only). So, to execute a QC on the quantification file stored at Example/Dataset/QC-DA/Parsed_proteins_set.txt, you can launch the following instruction in a terminal :
Rscript --vanilla run_proteomics_stats.R QC "../../Example/Datasets/QC-DA/Parsed_proteins_set.txt" "../../Example/Reports/QC_example.html" TRUE FALSE
The modules from the instrumental QC project take as input the "cytoc_xics" file generated by Proline's extraction module (format compliant with MSstatsQC) :
AcquiredTime | Precursor | MinStartTime | MaxEndTime | Annotations | Best.RT | MaxFWHM | TotalArea | moz.assymetry | mz |
---|---|---|---|---|---|---|---|---|---|
14/05/2019 06:43 | EDLIAYLK | 27.155064 | 28.234272 | RAW = QEMYC190513_16; mz = 482.7719957778697; delta PPM = 2 | 27.326147 | 0.009494722 | 5.92E+07 | 0.9891489 | 482.77 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
You can see examples in Example/Datasets/CytoC.
The instrumental QC project includes 2 modules (located at Modules/Instrumental QC):
- Descriptive_CytoC.rmd : this module generates a report which shows the evolution of metrics (RT, intensity, PPM) accross the acquired times (see Descriptive_report__cytoc_ref.html).
- Comparative_CytoC.rmd : this module generates a report which evaluates the validity of metrics in a test serie based on observed metrics in a reference serie (see Comparative_report__cytoc_ref.html).
The QC & DA modules work with a standardized input format that can be obtain from any quantification file thanks to the parsers stored in the "Parsers" folder. The ROC module take as input a folder containing tables generated by the DA module.
The QC module generates a report providing a global visualization of data quality and reproducibility (see Example/Reports/QC_proteins_set.html). The user can set two parameters in the command line:
Parameter | Function | Values |
---|---|---|
Normalization | Should intensities be normalized or not ? | T / F |
keep_empty_rows | Should empty rows be kept in the dataset ? | T / F |
format_svg | Choose SVG or PNG format for figure images | SVG / PNG |
coloring_by_group | Should figures be colored by group/condition? (WARNING : need to have Experimental Design to use this function) | T / F |
The DA module allows to process the data (filtering, normalization, missing values imputation) and to realize a statistical differential analysis to detect variants between biological conditions:
It takes as inputs the standardized quantification file, but also:
- The experimental design (see Example/Datasets/QC-DA/ExpDesign.txt)
- A set of parameters stored in a TSV file (see Example/Datasets/QC-DA/Parameters.txt) Here you can find a description of all parameters and possible values:
Parameter | Function | Values |
---|---|---|
Normalisation | Should intensities be normalized or not ? | T / F |
Proteins.for.normalization | List of protein IDs of interest for the data to be normalized to | Example : 1556122;1552942;1551169;1550971;1556987 |
Filter.threshold.ms | Minimum number of file identified by MS/MS to keep the protein for the analysis | ℕ |
Filter.threshold.obs | Minimum percentage of observed intensities in one condition at least to keep the protein for the analysis | {0,100} |
Imputation.MNAR.model | Which model should be used to impute MNAR ? | percentile / gaussian |
Imputation.MNAR.percentile | Which percentile should be used for the "percentile" imputation model ? | {0,1} |
Imputation.MCAR.model | Which model should be used to impute MCAR ? | none / MNAR / knn |
Imputation.MCAR.threshold.obs | Minimum number of observed intensities in a condition to classify the missing values as MCAR | {0, number of replicates} |
Imputation.MCAR.threshold.MSMS | Minimum number of file identified by MS/MS in a condition to classify the missing values as MCAR | {0, number of replicates} |
Imputation.knn.min.occurrences | Minimum number of observed intensities in a condition to use the protein as a k-nearest-neighnour for KNN imputation model | {0, number of replicates} |
Test.type | Which statistical test should be used ? | t.test / limma / wilcoxon |
Test.log | Should intensities be transformed by log10 before executing the statistical test ? | T / F |
Test.alternative | Choose if the statistical test should be unilateral or bilateral | two.sided / less / greater |
Test.paired | Choose if the statistical test should be paired or not | T / F |
Test.var.equal | If t.test is selected, precise if variance are equal (so a Student t.test will be used) or not (so a Welch t.test will be used) | T / F |
Test.adjust.FDR | Choose the FDR for multiple test correction | {0,1} |
Test.adjust.procedure | Choose the multiple test correction procedure | none / BH / ABH |
Fold.Change | Choose the ratio that should be computed between conditions | fc / zscore |
Volcano.threshold.pvalue | Significativity threshold for pvalue | {0,1} |
Volcano.threshold.fc | Significativity threshold for ratio | ℝ |
Volcano.coloring.style | Choose coloring based on signifigance or intensity (heatmap or monochrome) | signifigance / intensity / monochrome |
Volcano.manual.accession | List of protein accession codes for manual labeling | Example : Q9JHU4;A2ASS6;Q8VDD5;Q9QXS1 |
Volcano.no.labeling | Choose to have no labels on volcano plot | T / F |
Comparisons | List of paired conditions to compare | Exemple : 50fmol/25fmol;50fmol/10fmol |
Figure.format | Choose figure images to be in SVG or PNG format | SVG / PNG |
No.imputed.values | Choose to have no imputation of missing values | T / F |
Color.by.group | Choose whether to have random coloring, or coloring by condition | T / F |
The module gives as outputs a TSV table summarizing input data, intensities after each step of processing and final results of statistical results. A report containg QC figures for each processing step and figures showing the results of differential analysis is also provided (see _Example/Reports/DA_proteins__set.html_).
Three differents missing values imputation are available :
- Percentile-based
- Gaussian-based
- KNN-based
The percentile-based model consists in replacing missing values by a chosen percentile of the distribution of intensities in each sample.
The gaussian-based model consists in replacing missing values by values pulled in a gaussian simulation. The precise working of this model is described below:
Soit n le nombre de valeurs manquantes à imputer. Soit p1 et p2 les 1er et 2nd percentiles de la distribution des intensités observées dans la totalité des échantillons (A).
Dans un premier temps, une distribution gaussienne de médiane p1, d’écart-type celui des intensités observées comprises entre p1 et p2, et de taille 2n est simulée (B). Après suppression des valeurs négatives, cette simulation est ensuite divisée en 21 classes (C). Le nombre de valeurs manquantes tirées aléatoirement dans chacune des classes est alors proportionnel au nombre de valeurs de la simulation contenues dans chacune de ces classes (C).
The KNN-based model consists in replacing missing values in such a way that the resulting of the imputed proteins is very close to the one of its nearest neighbours. The precise working of this model is described below:
Soit nb le nombre de protéines dans le jeu de données. Soit X le vecteur des intensités d’une protéine P dans une condition C, et μ sa médiane. Pour imputer les valeurs manquantes de X, la première étape consiste à trouver les k (=√nb) protéines dont la médiane dans la condition C est la plus proche de μ. L’écart-type médian de ces k protéines, noté σexp, est l’écart-type final espéré après imputation des valeurs manquantes de X.
Pour obtenir un tel résultat, les valeurs manquantes de X sont d’abord remplacées μ (A). Si l’écart-type résultant, noté σres, est supérieur à σexp, alors on ne pourra jamais obtenir un écart-type plus proche de σexp, et les valeurs manquantes sont remplacées par μ dans le tableau de données. En revanche, si σres < σexp, la valeur de remplacement des valeurs manquantes est progressivement augmentée de d=Q3-μ jusqu’à trouver deux valeurs de remplacement dont les σres correspondants encadrent σexp (B). Les valeurs manquantes sont alors remplacées dans le jeu de données par des valeurs tirées aléatoirement dans une distribution gaussienne centrée sur la médiane entre les deux valeurs de remplacement et d’écart-type 0.05 (C).
This module allows the comparison of performances between DA workflows using different parameters or models. It generates ROC curves (see Example/Reports/ROC_report.html) from a set of parameters, a list of expected variants (see Example/Datasets/ROC), and the results table from DA module.