read_mapping_summary

This repository holds the tools needed for optimizing sequencing efforts using the Illumina NovaSeq 6000.

Section A. - When Mapping to a Portion of the Genome

Section B. - When Mapping to the Whole Genome

This repository will help you answer the following question: How much sequencing needs to be done to reach your desired depth of coverage?

Specifically, we are solving for the '% Duplicates' value needed for the Illumina Sequencing Coverage Calculator. To navigate to the window below, we clicked DNA Applications (blue button) > Whole-Genome Sequencing (in drop down menu).

Note: Although Illumina labels the unmapped reads as 'Duplicates,' this percent represents all discarded or otherwise unmapped reads

A. When Mapping to a Portion of the Genome

Solving for x%

We mapped reads to a portion (x%) of the genome. To calculate x%, we need to sum the number of nucleotides in the top 100 contigs (nt₁₀₀) that the reads were mapped to and divide that by the total number of nucleotides in the complete reference genome (nt_tot).

nt₁₀₀: count the number of nt in the "top 100" fasta file

# log into wahab.hpc.odu.edu
cd /home/cbird/roy/rroberts_thesis/summary_data_ssl/dDocentHPC_data/Aur/
cat reference.denovoSSL.Aur-C-all-R1R2ORPH-contam-noisolate.fasta | grep -v "NODE" | wc -m

nt₁₀₀ = 31156387
nt_tot: get from quast output for the library used to construct the reference genome

# denovo genome assembly repo has the answer, use quast output
git clone [email protected]:philippinespire/denovo_genome_assembly.git
cd denovo_genome_assembly/compare_assemblers

open wrangle_data.R in Rstudio
run script
open tibble named tbl_assembly
for Aur, filter Species column by Aur
sort by n50, high to low
top genome should be Aur_all_spades_contam_R1R2ORPH_21-99_noisolate
obtain value from total_length and estimated_reference_length
nt_tot = 439162585, 445000000

We then divide nt₁₀₀ by the nt_tot to find x% of the genome that our reads are mapping to:

x% = (31156387/439162585) * 100 = 7.09% (31156387/445000000) * 100 = 7.00%

This means that our reads are mapping to ~ 7.00% of the reference genome for the Aur species.

Solving for n_x%nraw

Given that we are mapping to x% of the reference genome, we assume that the same x% of the total number of raw reads from a given library will map to the reference. To solve for the expected number of mapped reads, n_x%nraw, we multiply x% by the total number of raw reads, nr_tot.

nr_tot: Find total number of raw reads add script here
n_x%nraw: Multiply x% by nr_tot

This means that xxx reads are expected to map to the reference for Aur.

Solving for q

Now that we know the expected number of mapped reads, n_x%nraw, we can divide the actual number of mapped reads, n_x%nmapped, by n_x%nraw to obtain the proportion mapped, p.

To solve for the proportion not mapped, q, subtract p from 1 and multiply by 100.

q = (1 - p) * 100

this percent value can then be plugged into the Illumina Sequencing Coverage Calculator

B. When Mapping to the Whole Genome

Given that we are mapping to 100% of the genome, we simply divide the number of mapped reads nr_mapped by the number of total raw reads nr_tot.

Data Used

nr_mapped: Find total number of mapped reads with mappedReadStats.sbatch

# login to wahab.hpc.odu.edu
cd /home/e1garcia/shotgun_PIRE/pire_lcwgs_data_processing/salarias_fasciatus
sbatch ../../rroberts_thesis/scripts/bam_processing/mappedReadStats.sbatch fltrBAM/ BAM_metrics Sfa .bam

nr_tot: Find total number of raw reads using fastqc and Multiqc on raw fq.gz files

This data was generated in step 1. of the pire_fq_gz_processing repo instructions using the Multi_FASTQC.sh script. It should already exist in your species directory.

Data Wrangled

I used this script to read-in/join fastqc and mapped reads data on Rstudio sequencing_calculations.R
Funtions to read in the data were sourced from read_multiqc.R
The functions used were read_multiqc_fastqc() and read_bam_reads()

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
IlluminaSeqCalc.png		IlluminaSeqCalc.png
README.md		README.md
mappedReadStats.bash		mappedReadStats.bash
out_Sfa_ReadStats.tsv		out_Sfa_ReadStats.tsv
percent_genome_mappedto.tsv		percent_genome_mappedto.tsv
sequencing_calculations.R		sequencing_calculations.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

read_mapping_summary

This repository holds the tools needed for optimizing sequencing efforts using the Illumina NovaSeq 6000.

Section A. - When Mapping to a Portion of the Genome

Section B. - When Mapping to the Whole Genome

A. When Mapping to a Portion of the Genome

Solving for x%

Solving for n_x%nraw

Solving for q

B. When Mapping to the Whole Genome

Data Used

Data Wrangled

About

Releases

Packages

Languages

cbirdlab/read_mapping_summary

Folders and files

Latest commit

History

Repository files navigation

read_mapping_summary

This repository holds the tools needed for optimizing sequencing efforts using the Illumina NovaSeq 6000.

Section A. - When Mapping to a Portion of the Genome

Section B. - When Mapping to the Whole Genome

A. When Mapping to a Portion of the Genome

Solving for x%

Solving for nx%nraw

Solving for q

B. When Mapping to the Whole Genome

Data Used

Data Wrangled

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Solving for n_x%nraw

Packages