Quantitative-proteomics

This repository contains Rmarkdown modules dedicated to statistical analysis of quantitative proteomic data.
They are divided up into 2 projects :

Instrumental quality control
Quantitative data analysis

The 1rst project contains modules that allow the controlling of LC-MS devices stability through a serie of acquisitions by analyzing the generated signal for a reference protein (ex : CytoC).
The 2nd project is dedicated to quantitative proteomic data analysis from QC to statistical differential analysis.
All of these modules generates html reports. Examples are provided in _Example/Reports/_.

1. Launch modules with command lines

All modules can be launched by command line thanks to the scripts stored in the "Launchers" folder. You can find working examples of command lines in the "run.bat" file of this folder (works on Windows only). So, to execute a QC on the quantification file stored at Example/Dataset/QC-DA/Parsed_proteins_set.txt, you can launch the following instruction in a terminal :

Rscript --vanilla run_proteomics_stats.R QC "../../Example/Datasets/QC-DA/Parsed_proteins_set.txt" "../../Example/Reports/QC_example.html" TRUE FALSE

2. Instrumental QC project

2.1. Input format

The modules from the instrumental QC project take as input the "cytoc_xics" file generated by Proline's extraction module (format compliant with MSstatsQC) :

AcquiredTime	Precursor	MinStartTime	MaxEndTime	Annotations	Best.RT	MaxFWHM	TotalArea	moz.assymetry	mz
14/05/2019 06:43	EDLIAYLK	27.155064	28.234272	RAW = QEMYC190513_16; mz = 482.7719957778697; delta PPM = 2	27.326147	0.009494722	5.92E+07	0.9891489	482.77
...	...	...	...	...	...	...	...	...	...

You can see examples in Example/Datasets/CytoC.

2.2. Modules description

The instrumental QC project includes 2 modules (located at Modules/Instrumental QC):

Descriptive_CytoC.rmd : this module generates a report which shows the evolution of metrics (RT, intensity, PPM) accross the acquired times (see Descriptive_report__cytoc_ref.html).
Comparative_CytoC.rmd : this module generates a report which evaluates the validity of metrics in a test serie based on observed metrics in a reference serie (see Comparative_report__cytoc_ref.html).

3. Quantitative data analysis project

3.1. Input format

The QC & DA modules work with a standardized input format that can be obtain from any quantification file thanks to the parsers stored in the "Parsers" folder. The ROC module take as input a folder containing tables generated by the DA module.

3.2. Modules description

3.2.1. QC module

The QC module generates a report providing a global visualization of data quality and reproducibility (see Example/Reports/QC_proteins_set.html). The user can set two parameters in the command line:

Parameter	Function	Values
Normalization	Should intensities be normalized or not ?	T / F
keep_empty_rows	Should empty rows be kept in the dataset ?	T / F
format_svg	Choose SVG or PNG format for figure images	SVG / PNG
coloring_by_group	Should figures be colored by group/condition? (WARNING : need to have Experimental Design to use this function)	T / F

3.2.2. DA module

The DA module allows to process the data (filtering, normalization, missing values imputation) and to realize a statistical differential analysis to detect variants between biological conditions:

It takes as inputs the standardized quantification file, but also:

The experimental design (see Example/Datasets/QC-DA/ExpDesign.txt)
A set of parameters stored in a TSV file (see Example/Datasets/QC-DA/Parameters.txt) Here you can find a description of all parameters and possible values:

Parameter	Function	Values
Normalisation	Should intensities be normalized or not ?	T / F
Proteins.for.normalization	List of protein IDs of interest for the data to be normalized to	Example : 1556122;1552942;1551169;1550971;1556987
Filter.threshold.ms	Minimum number of file identified by MS/MS to keep the protein for the analysis	ℕ
Filter.threshold.obs	Minimum percentage of observed intensities in one condition at least to keep the protein for the analysis	{0,100}
Imputation.MNAR.model	Which model should be used to impute MNAR ?	percentile / gaussian
Imputation.MNAR.percentile	Which percentile should be used for the "percentile" imputation model ?	{0,1}
Imputation.MCAR.model	Which model should be used to impute MCAR ?	none / MNAR / knn
Imputation.MCAR.threshold.obs	Minimum number of observed intensities in a condition to classify the missing values as MCAR	{0, number of replicates}
Imputation.MCAR.threshold.MSMS	Minimum number of file identified by MS/MS in a condition to classify the missing values as MCAR	{0, number of replicates}
Imputation.knn.min.occurrences	Minimum number of observed intensities in a condition to use the protein as a k-nearest-neighnour for KNN imputation model	{0, number of replicates}
Test.type	Which statistical test should be used ?	t.test / limma / wilcoxon
Test.log	Should intensities be transformed by log10 before executing the statistical test ?	T / F
Test.alternative	Choose if the statistical test should be unilateral or bilateral	two.sided / less / greater
Test.paired	Choose if the statistical test should be paired or not	T / F
Test.var.equal	If t.test is selected, precise if variance are equal (so a Student t.test will be used) or not (so a Welch t.test will be used)	T / F
Test.adjust.FDR	Choose the FDR for multiple test correction	{0,1}
Test.adjust.procedure	Choose the multiple test correction procedure	none / BH / ABH
Fold.Change	Choose the ratio that should be computed between conditions	fc / zscore
Volcano.threshold.pvalue	Significativity threshold for pvalue	{0,1}
Volcano.threshold.fc	Significativity threshold for ratio	ℝ
Volcano.coloring.style	Choose coloring based on signifigance or intensity (heatmap or monochrome)	signifigance / intensity / monochrome
Volcano.manual.accession	List of protein accession codes for manual labeling	Example : Q9JHU4;A2ASS6;Q8VDD5;Q9QXS1
Volcano.no.labeling	Choose to have no labels on volcano plot	T / F
Comparisons	List of paired conditions to compare	Exemple : 50fmol/25fmol;50fmol/10fmol
Figure.format	Choose figure images to be in SVG or PNG format	SVG / PNG
No.imputed.values	Choose to have no imputation of missing values	T / F
Color.by.group	Choose whether to have random coloring, or coloring by condition	T / F

The module gives as outputs a TSV table summarizing input data, intensities after each step of processing and final results of statistical results. A report containg QC figures for each processing step and figures showing the results of differential analysis is also provided (see _Example/Reports/DA_proteins__set.html_).

3.2.2.1. Imputation model description

Three differents missing values imputation are available :

Percentile-based
Gaussian-based
KNN-based

The percentile-based model consists in replacing missing values by a chosen percentile of the distribution of intensities in each sample.
The gaussian-based model consists in replacing missing values by values pulled in a gaussian simulation. The precise working of this model is described below:

Soit n le nombre de valeurs manquantes à imputer. Soit p1 et p2 les 1er et 2nd percentiles de la distribution des intensités observées dans la totalité des échantillons (A). Dans un premier temps, une distribution gaussienne de médiane p1, d’écart-type celui des intensités observées comprises entre p1 et p2, et de taille 2n est simulée (B). Après suppression des valeurs négatives, cette simulation est ensuite divisée en 21 classes (C). Le nombre de valeurs manquantes tirées aléatoirement dans chacune des classes est alors proportionnel au nombre de valeurs de la simulation contenues dans chacune de ces classes (C).

The KNN-based model consists in replacing missing values in such a way that the resulting of the imputed proteins is very close to the one of its nearest neighbours. The precise working of this model is described below:

Soit nb le nombre de protéines dans le jeu de données. Soit X le vecteur des intensités d’une protéine P dans une condition C, et μ sa médiane. Pour imputer les valeurs manquantes de X, la première étape consiste à trouver les k (=√nb) protéines dont la médiane dans la condition C est la plus proche de μ. L’écart-type médian de ces k protéines, noté σexp, est l’écart-type final espéré après imputation des valeurs manquantes de X. Pour obtenir un tel résultat, les valeurs manquantes de X sont d’abord remplacées μ (A). Si l’écart-type résultant, noté σres, est supérieur à σexp, alors on ne pourra jamais obtenir un écart-type plus proche de σexp, et les valeurs manquantes sont remplacées par μ dans le tableau de données. En revanche, si σres < σexp, la valeur de remplacement des valeurs manquantes est progressivement augmentée de d=Q3-μ jusqu’à trouver deux valeurs de remplacement dont les σres correspondants encadrent σexp (B). Les valeurs manquantes sont alors remplacées dans le jeu de données par des valeurs tirées aléatoirement dans une distribution gaussienne centrée sur la médiane entre les deux valeurs de remplacement et d’écart-type 0.05 (C).

3.2.3. ROC module

This module allows the comparison of performances between DA workflows using different parameters or models. It generates ROC curves (see Example/Reports/ROC_report.html) from a set of parameters, a list of expected variants (see Example/Datasets/ROC), and the results table from DA module.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
Example		Example
Images		Images
Launchers		Launchers
Library		Library
Modules		Modules
Parsers		Parsers
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantitative-proteomics

1. Launch modules with command lines

2. Instrumental QC project

2.1. Input format

2.2. Modules description

3. Quantitative data analysis project

3.1. Input format

3.2. Modules description

3.2.1. QC module

3.2.2. DA module

3.2.2.1. Imputation model description

3.2.3. ROC module

About

Releases

Packages

Contributors 4

Languages

david-bouyssie/Quantitative-proteomics

Folders and files

Latest commit

History

Repository files navigation

Quantitative-proteomics

1. Launch modules with command lines

2. Instrumental QC project

2.1. Input format

2.2. Modules description

3. Quantitative data analysis project

3.1. Input format

3.2. Modules description

3.2.1. QC module

3.2.2. DA module

3.2.2.1. Imputation model description

3.2.3. ROC module

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages