Steps and scripts to analyse RNA-seq data for allele-specific expression (ASE) and differential expression (DE) analysis, with emphasis on the reduction of reference bias
- prepare raw reads
- prepare genomes
- align reads
- process bam files
- call SNPs
- filter variants
- annotate SNPs
- ASE and DE analysis
- most computations were performed on UBELIX, the HPC cluster at the University of Bern
- some less intense computations were performed locally on macOS
-
raw RNA-seq data are on NCBI SRA PRJNA533335
-
these are 72 RNA-seq accessions:
DataPrep/SRA_metadata.csv
, with sample names coded as follows:- species names: Ax, Petunia axillaris; Ex, P. exserta; Pa, P. parodii; AxEx/AxPa/ExPa, F1 hybrids
- developmental stages: SS, small stage; MS1, medium 1 stage; MS2, medium 2 stage; LS, long stage
- 3-6 biological replicates per condition
-
101bp single-end Illumina sequencing with stranded data
-
trimming using trimmomatic version 0.36 and quality checks using fastqc version 0.11.7:
DataPrep/qualityControl.sh
→ biased sequence content for the first couple of bases, which is supposed not to be a problem
→ high level of sequence duplication, which is expected for RNA-seq data -
only sequences from P. axillaris, P. exserta, and their F1 were used for further analysis:
files_AxEx.txt
- working with the Petunia axillaris and the P. exserta genome in parallel
- P. axillaris v3.0.4 assembly (FASTA) from CoGe (not public) and annotations (GFF) from UBELIX
$HOME/../shared/304/Maker/Peax304_final_edited.gff
(not public) - P. exserta v3.0.3 assembly (FASTA) from CoGe (not public) and v3.0.4 annotations (GFF) from UBELIX
$HOME/../shared/304/Maker/Peex304_final_edited.gff
(not public)
NOTE: as P. exserta v3.0.3 and v3.0.4 FASTA are identical except scaffold names, these were simply renamed in the GFF - convert GFF to GTF with gffread version 0.11.4
gffread file.gff -T -E -o file.gtf
- convert GFF to BED with ea-utils (May 1, 2014)
gtf2bed file.gtf >file.bed
- create dictionary and index FASTA with
picard-tools CreateSequenceDictionary file.fa
andsamtools faidx file.fa
-
identify positions that differ between P. axillaris and P. exserta, and use them during alignment to reduce reference mapping bias
-
required to align reads with STAR in WASP mode, and to generate genome with IUPAC ambiguity codes for GATK SplitNCigarReads
-
as no genomic DNA reads for the used accessions were available, Illumina reads from the assemblies of the P. exserta and the P. axillaris genomes were used to identify SNPs
-
P. exserta DNA reads were mapped to the P. axillaris genome, and vice versa, to produce bam files (done previously)
-
pre-process bam files with mapped DNA reads using picard-tools version 2.18.11 (i.e., SortSam, AddOrReplaceReadGroups, MarkDuplicates, and ReorderSam):
GenomePrep/genomeSNPs_preprocessing.sh
-
to reduce runtime for SNP calling, the P. axillaris genome was split into eight intervals, but this was not done for the P. exserta genome (see below for how the genomes were split into many more intervals for SNP calling with RNA-seq reads)
-
call SNPs for DNA reads:
GenomePrep/genomeSNPs_call.sh
(picard-tools CreateSequenceDictionary and samtools faidx to make genome .dict and .fai files; samtools index to prepare bam files; GATK HaplotypeCaller in single sample mode to call SNPs) -
filter SNPs with GATK:
GenomePrep/genomeSNPs_filter.sh
(first round of SelectVariants and VariantFiltration for quality control, second round to find SNPs that are ALT/ALT, i.e., P. exserta is homozygous for different allele)
→ generates Exreads_on_Pax304_snps_filtered.tsv (quality control) and Exreads_on_Pax304_snps_filtered_2_sel.vcf (ALT/ALT positions) -
check filters with GATK VariantsToTable (included in
GenomePrep/genomeSNPs_filter.sh
) and plot results with Rst<-read.table("Exreads_on_Pax304_snps_filtered.tsv",header=TRUE) sts<-st$DP[st$DP<400] pdf("Pex_on_Pax_DP.pdf") plot(density(sts,na.rm=TRUE),xlim=c(0,400), xlab="Read depth",main="P. exserta reads on P. axillaris genome") dev.off() pdf("Pex_on_Pax.pdf") for(i in 8:13){ plot(density(st[st[,i]>=quantile(st[,i],probs=0.05,na.rm=TRUE) & st[,i]<=quantile(st[,i],probs=0.95,na.rm=TRUE),i],na.rm=TRUE), xlim=c(quantile(st[,i],probs=c(0.05,0.95),na.rm=TRUE)), xlab=colnames(st)[i],main="P. ex on P. ax") } dev.off()
- re-code ALT/ALT homozygous positions to REF/ALT heterozygous positions and index with bcftools:
GenomePrep/processVcfFiles_index.sh
(callsGenomePrep/manipulate_vcf.pl
: NOTE: this script is not overly sophisticated and assumes certain input format) - re-code these REF/ALT positions with IUPAC ambiguity codes with GATK FastaAlternateReferenceMaker:
GenomePrep/manipulateReference.sh
(additionally callsGenomePrep/manipulate_fasta.pl
and includes picard-tools CreateSequenceDictionary and samtools faidx to make new genome .dict and .fai files)
→ generates Petunia_axillaris_304_PexIUPAC.fa
-
split genome into intervals to make SNP calling manageable in terms of memory and run time
-
make (initially 200) genome intervals by splitting chromosomes/scaffolds by Ns:
GenomePrep/splitIntervals.sh
(picard-tools ScatterIntervalsByNs and GATK SplitIntervals) -
count number of intervals in each list
IDIR=$HOME/Style/Pax_genome/intervals/intervalLists/ for INTVL in {0..199}; do IFILE=$(ls ${IDIR}/*scattered.interval_list | sed -n "$(( ${INTVL} + 1 ))p") cnt=`grep -v "^@" ${IFILE} | grep -c "^Peax"` echo "${INTVL} ${cnt}" >>intvlcnts.txt done
→ a large number of intervals is contained in lists from 0172 to 0199 (because of a bug?)
-
make second round of interval splitting for lists that contain a large number of intervals:
GenomePrep/splitIntervals_2nd.sh
-
count number of intervals in new lists
## run in $HOME/Style/Pax_genome/intervals/intervalLists/split_2nd/ grep -c "^Peax" ./* >intvlcnt.txt cut -d ':' -f2 intvlcnt.txt | sort | uniq
→ again, a large number of intervals is contained in list 0582
-
make third round of interval splitting for interval 0582:
GenomePrep/splitIntervals_3rd.sh
-
count number of new intervals
## run in $HOME/Style/Pax_genome/intervals/intervalLists/split_3rd/ grep -c "^Peax" ./* >intvlcnt.txt cut -d ':' -f2 intvlcnt.txt | sort | uniq
→ done; created a total of 588 intervals, which are used below to call SNPs in RNA-seq data
-
using OMA StandAlone version 2.3.1 to identify orthologs between P. axillaris and P. exserta
-
prepare OMA data base:
orthologs/oma_prep.sh
(callsorthologs/fastaAdjust.pl
,orthologs/fastaFilter.pl
, andorthologs/fastaInfo.pl
; the perl scriptsorthologs/fastaAdjust.pl
andorthologs/fastaFilter.pl
are based on code from the orthoMCL orthomclAdjustFasta tool from the OrthoMCL software (Li et al. 2003)) -
run OMA in tree steps (copy and paste contents of scripts):
-
step 1:
orthologs/oma_ube_0.sh
-
step 2:
orthologs/oma_ube_1st.sh
-
step 3:
orthologs/oma_ube_2nd.sh
→ generates Map-SeqNum-ID.txt and OrthologousMatrix.txt
-
using STAR version 2.6.0c
-
working with the P. axillaris and the P. exserta genome in parallel
-
STAR in WASP mode requires vcf file with variant positions (e.g., Exreads_on_Pax304_snps_filtered_2_sel.vcf, generated above)
-
step 1: generate genome indexes:
Alignment/prepareGenome.sh
NOTE: there should be nothing else in the genome directory as this step may fail with an unspecific message otherwise -
step 2: mapping in two-pass mode:
Alignment/mapReads_firstPass.sh
andAlignment/mapReads_secondPass.sh
NOTE: a potentially better way to deal with a large number of reads not aligned because they are "too short" is to modify the penalty for indels during mapping instead of setting --outFilterScoreMinOverLread 0.3 and --outFilterMatchNminOverLread 0.3 (needs to be tested) -
check mapping performance
## get basic mapping statistics for each sample dir=`sed -n '1p' $HOME/Style/step02_AxEx/files_AxEx.txt` echo "sample" >mapping_stats.txt grep "%" ${dir}/Log.final.out | cut -f1 | sed 's/ |$//g' | sed -E 's/^\s+//g' >>mapping_stats.txt while read dir; do echo "$dir" >tmp grep "%" ${dir}/Log.final.out | cut -f2 | sed 's/%$//g' >>tmp paste mapping_stats.txt <(cat tmp) >tmp2 mv -f tmp2 mapping_stats.txt rm -f tmp done <$HOME/Style/step02_AxEx/files_AxEx.txt
## make plots with R st<-read.table("mapping_stats.txt",header=TRUE,sep='\t',as.is=TRUE) pax<-which(grepl("Ax",colnames(st)) & !grepl("Ex",colnames(st))) pex<-which(grepl("Ex",colnames(st)) & !grepl("Ax",colnames(st))) f1<-which(grepl("AxEx",colnames(st))) grp<-character(ncol(st)-1) grp[pax-1]<-"1" grp[pex-1]<-"3" grp[f1-1]<-"2" grp<-as.factor(grp) pdf("mapping_stats.pdf",pointsize=16) par(mfrow=c(2,2),mar=c(3,4,1,1)) for(i in c(1,2,5,8)){ boxplot(as.numeric(st[i,-1]) ~ grp, ylab=st[i,1], names=c("Pax","F1","Pex"), xlab="") } dev.off() pdf("mapping_stats_indel.pdf",pointsize=14,height=3.5) par(mfrow=c(1,2),mar=c(3,4,1,1)) for(i in c(3,4)){ boxplot(as.numeric(st[i,-1]) ~ grp, ylab=st[i,1], names=c("Pax","F1","Pex"), xlab="") } dev.off() pdf("mapping_stats_other.pdf",pointsize=16) par(mfrow=c(2,2),mar=c(3,4,1,1)) for(i in c(6,7,9,10)){ boxplot(as.numeric(st[i,-1]) ~ grp, ylab=st[i,1], names=c("Pax","F1","Pex"), xlab="") } dev.off()
-
quality control of aligned reads with MultiQC version 1.8:
Alignment/mapReads_stats_multiQC.sh
-
index original STAR output bam files with samtools index:
SNPs/processBamFiles_index.sh
-
mark duplicates (and index) with picard-tools MarkDuplicates:
SNPs/processBamFiles.sh
-
split reads that contain Ns in their cigar string with GATK SplitNCigarReads:
SNPs/splitCigar_md.sh
, using manipulated reference genome with IUPAC ambiguity codes (e.g., Petunia_axillaris_304_PexIUPAC.fa, generated above)
NOTE: setting maximum number of mismatches allowed in the overhang to 2 (STAR --outFilterMismatchNoverReadLmax 0.04 used above means in a 50:50 overhang could be on average 2 mismatches: --max-mismatches-in-overhang 2)→ this also adjusts mapping quality from 255 to 60 and only keeps HI, nM, AS flags (index of multimapping read, number of mismatches, alignment score)
- step 1:
SNPs/SNPs_call_md.sh
, removing read duplicates (picard-tools AddOrReplaceReadGroups and GATK HaplotypeCaller) - step 2:
SNPs/SNPs_callCohortDB_md_array.sh
(GATK GenomicsDBImport and GATK GenotypeGVCFs)
NOTE: requires a huge amount of memory; splitting the genome into a few hundred intervals makes it manageable - check whether all runs finished by grepping log files for "OutOfMemoryError" and "IllegalStateException"
- step 3:
SNPs/SNPs_callCohortDB_md_gather.sh
(picard-tools MergeVcfs) - after finishing, remove all the interval_* directories, as they contain thousands of files
-
filter variants using hard filters with GATK:
SNPs/SNPs_filter_md.sh
(SelectVariants, VariantFiltration, and VariantsToTable)
→ generates cohort_md_snps_filtered.tsv and cohort_md_snps_filtered_sel.vcf -
make plot with R
st<-read.table("cohort_md_snps_filtered.tsv",header=TRUE) sts<-st$DP[st$DP<400] ##plot(density(sts,na.rm=TRUE),xlim=c(0,400),xlab="Read depth",main="RNA on P. ax") pdf("RNA_on_Pax.pdf") for(i in 8:13){ plot(density(st[st[,i]>=quantile(st[,i],probs=0.05,na.rm=TRUE) & st[,i]<=quantile(st[,i],probs=0.95,na.rm=TRUE),i],na.rm=TRUE), xlim=c(quantile(st[,i],probs=c(0.05,0.95),na.rm=TRUE)),xlab=colnames(st)[i], main="RNA on P. ax") } dev.off()
-
using ANNOVAR (2019Oct24), following steps as described in the user guide
## use gtfToGenePred tool to convert GTF file to GenePred file $HOME/bin/gtfToGenePred -genePredExt \ $HOME/Style/Pax_genome/Peax304_final_edited.gtf \ Peax304_refGene.txt ## generate a transcript FASTA file with script provided by ANNOVAR perl $HOME/bin/annovar/retrieve_seq_from_fasta.pl --format refGene --seqfile \ $HOME/Style/Pax_genome/Petunia_axillaris_304.fa \ Peax304_refGene.txt --out Peax304_refGeneMrna.fa mkdir Peax304db mv Peax304_* Peax304db/ ## modify file with variants grep "^Peax" $HOME/Style/step04_AxEx/Pax_genome/SNPs/cohort_md_snps_filtered_sel.vcf >tmp paste <(cut -f1,2 tmp) <(cut -f2 tmp) <(cut -f4,5 tmp) > snps_filtered_sel.txt rm -f tmp ## issue gene-based annotation perl $HOME/bin/annovar/annotate_variation.pl --outfile snps_filtered_sel_anno \ --buildver Peax304 \ snps_filtered_sel.txt Peax304db/
→ generates snps_filtered_sel_anno.variant_function
- calculate read counts per REF and ALT allele for each sample with GATK ASEReadCounter, not removing read duplicates:
ASE/countReadsASE.sh
- filter SNPs and make infiles for GeneiASE from info for each triplet of samples per stage (two parental species and F1):
ASE/ASE_analysis_infiles.R
→ generates input.tab, bias.txt, and SNPs.txt - run GeneiASE version 1.0.1 separately for each F1 hybrid:
ASE/runGeneiASE.sh
→ generates geneiase.out - run MBASED version 1.20.0 separately for each F1 hybrid:
ASE/runMBASED.sh
(callsASE/runMBASED.R
)
→ generates mbased.out - check where reads that map to a P. axillaris gene map in the P. exserta genome, separately for each F1 hybrid:
ASE/readCorresp.sh
(callsASE/readCorresp.pl
) andASE/readCorresp_stats.sh
(callsASE/readCorresp_stats.R
)
→ generates mapReport.txt - merge bam files for all samples using picard-tools MergeSamFiles:
DE/mapReads_merge.sh
- run featureCounts from the Subread package version 1.6.4 to count reads per gene, not removing read duplicates:
DE/countReads.sh
→ generates featureCounts_results.txt - quality control of counted reads with MultiQC version 1.8:
DE/countReads_stats_multiQC.sh
-
analyse results per stage with DESeq2 version 1.24.0 (using apeglm shrinkage), NOISeq version 2.28.0, and custom functions:
ASE/ASE_DE_analysis.R
→ generates <stage>_results.csv -
combine results from parallel runs that either used the P. axillaris or the P. exserta genome:
ASE/ASE_DE_analysis_bidir.R
-
additionally, perform DE analysis per stage for all genes with DESeq2 (using apeglm shrinkage):
DE/DE_analysis_quick.R
→ generates results_<stage>.csv -
additionally, compute correlation between gene expression and phenotypic traits across stages for each gene:
DE/correlation_analysis.R