Skip to content

Latest commit

 

History

History

scripts

scripts

This directory contains various scripts used by the pipeline. However, you can use most of these scripts on their own, too. Some may even be helpful in day-to-day use.

All python scripts implement the --help argument. For bash, R, and awk scripts, you can run head <script> to read about their usage.

A python script that uses files from the prepare and classify pipelines to create a VCF with the final, predicted variants. This script also has a special internal mode, which can be used for recalibrating the QUAL scores output in the VCF.

A bash script for identifying sites at which the variant callers in our ensemble outputted conflicting alleles.

A bash script for extracting columns from TSVs via grep. Every argument besides the first is passed directly to grep.

A fast awk script for classifying each site in a VCF as DEL, INS, SNP, etc. It accepts a two column table (REF and ALT) from the VCF.

A bash script for converting all REF/ALT columns in a TSV to binary positive/negative labels using classify.awk.

A bash script for replacing NA values in a large TSV.

A bash script for filtering rows from a large TSV by specific columns.

A python script for creating plots of the importance of each variable (ie feature) outputted by each variant caller.

A python script for calculating evaluation metrics on a two column TSV of binary labels: truth and predictions.

A python script for summarizing multiple metrics files output by metrics.py in a nicely formatted table.

A fast awk script for ensuring that unusual numerical values in a large TSV can be read by R.

A python script for creating precision-recall plots. It takes as input the output of metrics.py and/or statistics.py.

An R script for predicting variants using a trained classifier. It takes as input a model generated by train_RF.R.

A python script for creating ROC plots. It takes as input the output of statistics.py.

A python script for generating points to use in a precision-recall or ROC curve. It takes as input a two column TSV: true labels and prediction p-values.

An R script for creating a trained classifier. We recommend using the Snakefile-classify pipeline to run this script.

An R script for visualizing the results of hyperparameter tuning from the train_RF.R script.