Skip to content

Rmarkdown modules dedicated to statistical analysis of quantitative proteomics data

Notifications You must be signed in to change notification settings

david-bouyssie/Quantitative-proteomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quantitative-proteomics

This repository contains Rmarkdown modules dedicated to statistical analysis of quantitative proteomic data.
They are divided up into 2 projects :

  • Instrumental quality control
  • Quantitative data analysis

The 1rst project contains modules that allow the controlling of LC-MS devices stability through a serie of acquisitions by analyzing the generated signal for a reference protein (ex : CytoC).
The 2nd project is dedicated to quantitative proteomic data analysis from QC to statistical differential analysis.
All of these modules generates html reports. Examples are provided in _Example/Reports/_.

1. Launch modules with command lines

All modules can be launched by command line thanks to the scripts stored in the "Launchers" folder. You can find working examples of command lines in the "run.bat" file of this folder (works on Windows only). So, to execute a QC on the quantification file stored at Example/Dataset/QC-DA/Parsed_proteins_set.txt, you can launch the following instruction in a terminal :

Rscript --vanilla run_proteomics_stats.R QC "../../Example/Datasets/QC-DA/Parsed_proteins_set.txt" "../../Example/Reports/QC_example.html" TRUE FALSE

2. Instrumental QC project

2.1. Input format

The modules from the instrumental QC project take as input the "cytoc_xics" file generated by Proline's extraction module (format compliant with MSstatsQC) :


AcquiredTime Precursor MinStartTime MaxEndTime Annotations Best.RT MaxFWHM TotalArea moz.assymetry mz
14/05/2019 06:43 EDLIAYLK 27.155064 28.234272 RAW = QEMYC190513_16; mz = 482.7719957778697; delta PPM = 2 27.326147 0.009494722 5.92E+07 0.9891489 482.77
... ... ... ... ... ... ... ... ... ...

You can see examples in Example/Datasets/CytoC.

2.2. Modules description

The instrumental QC project includes 2 modules (located at Modules/Instrumental QC):

  • Descriptive_CytoC.rmd : this module generates a report which shows the evolution of metrics (RT, intensity, PPM) accross the acquired times (see Descriptive_report__cytoc_ref.html).
  • Comparative_CytoC.rmd : this module generates a report which evaluates the validity of metrics in a test serie based on observed metrics in a reference serie (see Comparative_report__cytoc_ref.html).

3. Quantitative data analysis project

3.1. Input format

The QC & DA modules work with a standardized input format that can be obtain from any quantification file thanks to the parsers stored in the "Parsers" folder. The ROC module take as input a folder containing tables generated by the DA module.

3.2. Modules description

3.2.1. QC module

The QC module generates a report providing a global visualization of data quality and reproducibility (see Example/Reports/QC_proteins_set.html). The user can set two parameters in the command line:

Parameter Function Values
Normalization Should intensities be normalized or not ? T / F
keep_empty_rows Should empty rows be kept in the dataset ? T / F
format_svg Choose SVG or PNG format for figure images SVG / PNG
coloring_by_group Should figures be colored by group/condition? (WARNING : need to have Experimental Design to use this function) T / F

3.2.2. DA module

The DA module allows to process the data (filtering, normalization, missing values imputation) and to realize a statistical differential analysis to detect variants between biological conditions:

Alt text


It takes as inputs the standardized quantification file, but also:

  • The experimental design (see Example/Datasets/QC-DA/ExpDesign.txt)
  • A set of parameters stored in a TSV file (see Example/Datasets/QC-DA/Parameters.txt) Here you can find a description of all parameters and possible values:
Parameter Function Values
Normalisation Should intensities be normalized or not ? T / F
Proteins.for.normalization List of protein IDs of interest for the data to be normalized to Example : 1556122;1552942;1551169;1550971;1556987
Filter.threshold.ms Minimum number of file identified by MS/MS to keep the protein for the analysis
Filter.threshold.obs Minimum percentage of observed intensities in one condition at least to keep the protein for the analysis {0,100}
Imputation.MNAR.model Which model should be used to impute MNAR ? percentile / gaussian
Imputation.MNAR.percentile Which percentile should be used for the "percentile" imputation model ? {0,1}
Imputation.MCAR.model Which model should be used to impute MCAR ? none / MNAR / knn
Imputation.MCAR.threshold.obs Minimum number of observed intensities in a condition to classify the missing values as MCAR {0, number of replicates}
Imputation.MCAR.threshold.MSMS Minimum number of file identified by MS/MS in a condition to classify the missing values as MCAR {0, number of replicates}
Imputation.knn.min.occurrences Minimum number of observed intensities in a condition to use the protein as a k-nearest-neighnour for KNN imputation model {0, number of replicates}
Test.type Which statistical test should be used ? t.test / limma / wilcoxon
Test.log Should intensities be transformed by log10 before executing the statistical test ? T / F
Test.alternative Choose if the statistical test should be unilateral or bilateral two.sided / less / greater
Test.paired Choose if the statistical test should be paired or not T / F
Test.var.equal If t.test is selected, precise if variance are equal (so a Student t.test will be used) or not (so a Welch t.test will be used) T / F
Test.adjust.FDR Choose the FDR for multiple test correction {0,1}
Test.adjust.procedure Choose the multiple test correction procedure none / BH / ABH
Fold.Change Choose the ratio that should be computed between conditions fc / zscore
Volcano.threshold.pvalue Significativity threshold for pvalue {0,1}
Volcano.threshold.fc Significativity threshold for ratio
Volcano.coloring.style Choose coloring based on signifigance or intensity (heatmap or monochrome) signifigance / intensity / monochrome
Volcano.manual.accession List of protein accession codes for manual labeling Example : Q9JHU4;A2ASS6;Q8VDD5;Q9QXS1
Volcano.no.labeling Choose to have no labels on volcano plot T / F
Comparisons List of paired conditions to compare Exemple : 50fmol/25fmol;50fmol/10fmol
Figure.format Choose figure images to be in SVG or PNG format SVG / PNG
No.imputed.values Choose to have no imputation of missing values T / F
Color.by.group Choose whether to have random coloring, or coloring by condition T / F

The module gives as outputs a TSV table summarizing input data, intensities after each step of processing and final results of statistical results. A report containg QC figures for each processing step and figures showing the results of differential analysis is also provided (see _Example/Reports/DA_proteins__set.html_).
3.2.2.1. Imputation model description

Three differents missing values imputation are available :

  • Percentile-based
  • Gaussian-based
  • KNN-based

The percentile-based model consists in replacing missing values by a chosen percentile of the distribution of intensities in each sample.
The gaussian-based model consists in replacing missing values by values pulled in a gaussian simulation. The precise working of this model is described below:

Alt text Soit n le nombre de valeurs manquantes à imputer. Soit p1 et p2 les 1er et 2nd percentiles de la distribution des intensités observées dans la totalité des échantillons (A). Dans un premier temps, une distribution gaussienne de médiane p1, d’écart-type celui des intensités observées comprises entre p1 et p2, et de taille 2n est simulée (B). Après suppression des valeurs négatives, cette simulation est ensuite divisée en 21 classes (C). Le nombre de valeurs manquantes tirées aléatoirement dans chacune des classes est alors proportionnel au nombre de valeurs de la simulation contenues dans chacune de ces classes (C).


The KNN-based model consists in replacing missing values in such a way that the resulting of the imputed proteins is very close to the one of its nearest neighbours. The precise working of this model is described below:

Alt text Soit nb le nombre de protéines dans le jeu de données. Soit X le vecteur des intensités d’une protéine P dans une condition C, et μ sa médiane. Pour imputer les valeurs manquantes de X, la première étape consiste à trouver les k (=√nb) protéines dont la médiane dans la condition C est la plus proche de μ. L’écart-type médian de ces k protéines, noté σexp, est l’écart-type final espéré après imputation des valeurs manquantes de X. Pour obtenir un tel résultat, les valeurs manquantes de X sont d’abord remplacées μ (A). Si l’écart-type résultant, noté σres, est supérieur à σexp, alors on ne pourra jamais obtenir un écart-type plus proche de σexp, et les valeurs manquantes sont remplacées par μ dans le tableau de données. En revanche, si σres < σexp, la valeur de remplacement des valeurs manquantes est progressivement augmentée de d=Q3-μ jusqu’à trouver deux valeurs de remplacement dont les σres correspondants encadrent σexp (B). Les valeurs manquantes sont alors remplacées dans le jeu de données par des valeurs tirées aléatoirement dans une distribution gaussienne centrée sur la médiane entre les deux valeurs de remplacement et d’écart-type 0.05 (C).


3.2.3. ROC module

This module allows the comparison of performances between DA workflows using different parameters or models. It generates ROC curves (see Example/Reports/ROC_report.html) from a set of parameters, a list of expected variants (see Example/Datasets/ROC), and the results table from DA module.

About

Rmarkdown modules dedicated to statistical analysis of quantitative proteomics data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages