This repository houses conceptual viewpoints, coding practice, assignment/competition solutions based on the materials from a variety of computational biology/bioinformatics courses, workshops, technical manuals, academic articles, and others.
Last updated: 7 Feb 2025
- Bulk RNA-seq data analysis
- Single cell RNA-seq data analysis
- ATAC-seq data analysis
- Multi-omics idea
- COVID-19 RNA-seq data resources
- Run FastQC or fastp to evaluate sequence quality and content
- [Recommend] Use splice-aware genome aligner STAR to align the reads
- Other splice-aware alignment tool options: Olego, HISAT2, MapSplice, ABMapper, Passion, BLAT, RUM ...
- Other alignment tools that disregard isoforms: BWA, Bowtie2 ...
- Use Rsubread to align the reads
- Why align? To pinpoint the specific location on the human genome from which our reads originated
- Use Qualimap to perform quality assurance on the aligned reads
- Use MultiQC to harmonize all QC and alignment metadata from FastQC, STAR, Qualimap, and other tools
- Use GenomicAlignments for aligned reads to obtain the gene-level or exon-level quantification
- Use featureCounts for aligned reads to count the fragments
- [Recommend] Use Salmon for unaligned reads to obtain the transcript-level quantification
- Why unalign? To speed up the counting process of reads
- Next step: Use tximport to aggregate transcript-level quantification to the gene level
- Perform differential gene expression analysis
- Perform principal component analysis, heatmap, and clustering
- Perform gene set enrichment analysis
- Achieve cell-type resolution in bulk RNA-Seq through deconvolution techniques (Under Active Construction)
- A mini scRNA-seq pipeline | Why apply pipelines?
- If given raw
bcl
files, we convert them to fastq files - As inputs are
fastq
files, we can ...- Run FastQC to evaluate sequence quality and content
- Use Trim Galore to trim reads if we spot unexpected low-quality base calls/adaptor contamination
- Re-run FastQC to re-evaluate sequence quality and content
- If single-cell RNA-seq data is generated from the plate-based protocol, we can ...
- Use STAR to perform alignment and FeatureCounts to generate the count matrix
- Else if single-cell RNA-seq data is generated from the droplet-based protocol, we can ...
- Use kb-python package to perform pseudo sequence alignment and generate the count matrix
- Use Cell Ranger pipelines to perform sequence alignment and generate the count matrix
- After having the
feature-barcode matrices
at hand, we can ...- Use Scanpy workflow to perform quality assurance, cell clustering, marker gene detection for cell identities, and trajectory inference
- Use Seurat workflow to perform quality assurance, cell clustering, and marker gene detection for cell identities case 1 | case 2
- If we observe the factor-specific clustering and want cells of the same cell type cluster together across single/multiple confounding factors, we can use canonical correlation analysis or Harmony (suitable for complicated confounding effects) to integrate cells
- We can leverage SingleR or ScType to partially or fully automate cell-type identification
- Other options of automating cell-type identification by mapping to references and then transfering labels: scArches, Symphony
- Use Bioconductor packages to perform single cell RNA-Seq data analysis
- Generate pseudobulk, which aggregates the gene expression levels specific to each cell type within an individual
- Perform pseudobulk-based differentially gene expression analysis in edgeR or DESeq2
- Use bulk RNAseq-based pathway analysis tools (e.g., clusterProfiler, GSEA, GSVA) or single cell RNAseq-based Pagoda2 to evaluate if a predefined set of genes shows statistically significant and consistent variations between biological conditions
- Use scGen to model perturbation responses
- Frontier single-cell RNA-seq data analyses
- Practical guide
- Run ENCODE ATAC-seq pipeline to perform alignment, quality assurance, peaking calling, and signal track generation
- If we're interested in inspecting every step in each analytical phase, or even leveraging advanced/unique features of other tools that the current pipeline ignores, ...
- For alignment and post-alignment phases, we can ...
- Use Rsubread or Rbowtie2 to align the fastq files relative to hg19/hg38/hs1
- Use GenomicAlignments and GenomicRanges to perform post-alignment processing including reading properly paired reads, estimating MapQ scores/insert sizes, reconstructing the full-length fragment, and others
- Use ATACseqQC to perform comprehensive ATAC-seq quality assurance
- For TSS analysis phase, we can ...
- Use soGGi to assess the transcriptional start site signal in the nucleosome-free open region
- For peaking calling phase, we can ...
- Use MACS2 and ChIPQC to call peaks in the nucleosome-free open region, and perform quality assurance
- Or use Genrich to call peaks in the nucleosome-free open region
- Or use MACS3/MACSr (R wrapper of MACS3) to call peaks in the nucleosome-free open region
- Use ChIPseeker to annotate peak regions with genomic features
- For functional analysis phase, we can ...
- Use rGREAT to functionally interpret the peak regions based on the GO database
- Use GenomicRanges and GenomicAlignments to select and count non-redundant peaks
- Use DESeq2/DESeq2-based DiffBind and ChIPseeker to analyze differences in peaks with gene annotations across conditions
- Use clusterProfiler to perform enrichment analysis of differential peak regions
- However, functional insights gained by peak annotations can hardly illustrate what key regulators shape the transcription mechanism.
- To further infer transcription factors acting in peak regions, we can ...
- Use MotifDb/JASPAR2022 and seqLogo/ [recommend] ggseqlogo to search and visualize motifs
- Use motifmatchr (R wrapper of MOODS) to map peaks to motifs, DNA sequences preferred by transcription factors
- Use chromVAR to analyze differences in motifs across conditions
- For alignment and post-alignment phases, we can ...
- Transfer the cell type labels from single-cell RNA-seq data to separately collected single-cell ATAC-seq data
- A quick start from loading an online spectrum, performing peak quality control, annotating peaks, to visualizing the annotated peaks