Quantitative Biomedicine ✨

This repository houses conceptual viewpoints, coding practice, assignment/competition solutions based on the materials from a variety of computational biology/bioinformatics courses, workshops, technical manuals, academic articles, and others.

Last updated: 7 Feb 2025

Features

Bulk RNA-seq data analysis
Single cell RNA-seq data analysis
ATAC-seq data analysis
Multi-omics idea
COVID-19 RNA-seq data resources

Technical procedures

Analyze bulk RNA-seq data

Run FastQC or fastp to evaluate sequence quality and content
[Recommend] Use splice-aware genome aligner STAR to align the reads
- Other splice-aware alignment tool options: Olego, HISAT2, MapSplice, ABMapper, Passion, BLAT, RUM ...
- Other alignment tools that disregard isoforms: BWA, Bowtie2 ...
Use Rsubread to align the reads
- Why align? To pinpoint the specific location on the human genome from which our reads originated
Use Qualimap to perform quality assurance on the aligned reads
Use MultiQC to harmonize all QC and alignment metadata from FastQC, STAR, Qualimap, and other tools
Use GenomicAlignments for aligned reads to obtain the gene-level or exon-level quantification
Use featureCounts for aligned reads to count the fragments
[Recommend] Use Salmon for unaligned reads to obtain the transcript-level quantification
- Why unalign? To speed up the counting process of reads
- Next step: Use tximport to aggregate transcript-level quantification to the gene level
Perform differential gene expression analysis
Perform principal component analysis, heatmap, and clustering
Perform gene set enrichment analysis
Achieve cell-type resolution in bulk RNA-Seq through deconvolution techniques (Under Active Construction)

Analyze single cell RNA-seq data

A mini scRNA-seq pipeline | Why apply pipelines?
If given raw bcl files, we convert them to fastq files
As inputs are fastq files, we can ...
- Run FastQC to evaluate sequence quality and content
- Use Trim Galore to trim reads if we spot unexpected low-quality base calls/adaptor contamination
- Re-run FastQC to re-evaluate sequence quality and content
- If single-cell RNA-seq data is generated from the plate-based protocol, we can ...
  - Use STAR to perform alignment and FeatureCounts to generate the count matrix
- Else if single-cell RNA-seq data is generated from the droplet-based protocol, we can ...
  - Use kb-python package to perform pseudo sequence alignment and generate the count matrix
  - Use Cell Ranger pipelines to perform sequence alignment and generate the count matrix
After having the feature-barcode matrices at hand, we can ...
- Use Scanpy workflow to perform quality assurance, cell clustering, marker gene detection for cell identities, and trajectory inference
- Use Seurat workflow to perform quality assurance, cell clustering, and marker gene detection for cell identities case 1 | case 2
  - If we observe the factor-specific clustering and want cells of the same cell type cluster together across single/multiple confounding factors, we can use canonical correlation analysis or Harmony (suitable for complicated confounding effects) to integrate cells
  - We can leverage SingleR or ScType to partially or fully automate cell-type identification
    - Other options of automating cell-type identification by mapping to references and then transfering labels: scArches, Symphony
- Use Bioconductor packages to perform single cell RNA-Seq data analysis
- Generate pseudobulk, which aggregates the gene expression levels specific to each cell type within an individual
- Perform pseudobulk-based differentially gene expression analysis in edgeR or DESeq2
- Use bulk RNAseq-based pathway analysis tools (e.g., clusterProfiler, GSEA, GSVA) or single cell RNAseq-based Pagoda2 to evaluate if a predefined set of genes shows statistically significant and consistent variations between biological conditions
- Use scGen to model perturbation responses
Frontier single-cell RNA-seq data analyses

Analyze ATAC-seq data

Practical guide
Run ENCODE ATAC-seq pipeline to perform alignment, quality assurance, peaking calling, and signal track generation
If we're interested in inspecting every step in each analytical phase, or even leveraging advanced/unique features of other tools that the current pipeline ignores, ...
- For alignment and post-alignment phases, we can ...
  - Use Rsubread or Rbowtie2 to align the fastq files relative to hg19/hg38/hs1
  - Use GenomicAlignments and GenomicRanges to perform post-alignment processing including reading properly paired reads, estimating MapQ scores/insert sizes, reconstructing the full-length fragment, and others
  - Use ATACseqQC to perform comprehensive ATAC-seq quality assurance
- For TSS analysis phase, we can ...
  - Use soGGi to assess the transcriptional start site signal in the nucleosome-free open region
- For peaking calling phase, we can ...
  - Use MACS2 and ChIPQC to call peaks in the nucleosome-free open region, and perform quality assurance
  - Or use Genrich to call peaks in the nucleosome-free open region
  - Or use MACS3/MACSr (R wrapper of MACS3) to call peaks in the nucleosome-free open region
  - Use ChIPseeker to annotate peak regions with genomic features
- For functional analysis phase, we can ...
  - Use rGREAT to functionally interpret the peak regions based on the GO database
  - Use GenomicRanges and GenomicAlignments to select and count non-redundant peaks
  - Use DESeq2/DESeq2-based DiffBind and ChIPseeker to analyze differences in peaks with gene annotations across conditions
  - Use clusterProfiler to perform enrichment analysis of differential peak regions
  - However, functional insights gained by peak annotations can hardly illustrate what key regulators shape the transcription mechanism.
- To further infer transcription factors acting in peak regions, we can ...
  - Use MotifDb/JASPAR2022 and seqLogo/ [recommend] ggseqlogo to search and visualize motifs
  - Use motifmatchr (R wrapper of MOODS) to map peaks to motifs, DNA sequences preferred by transcription factors
  - Use chromVAR to analyze differences in motifs across conditions
Transfer the cell type labels from single-cell RNA-seq data to separately collected single-cell ATAC-seq data

Analyze proteomics data

A quick start from loading an online spectrum, performing peak quality control, annotating peaks, to visualizing the annotated peaks

Name		Name	Last commit message	Last commit date
Latest commit History 1,102 Commits
ATACSeq		ATACSeq
Bash		Bash
BulkRNASeq		BulkRNASeq
ChIPSeq		ChIPSeq
Conda		Conda
Docker		Docker
FastQC		FastQC
Git		Git
Methylation		Methylation
Perl		Perl
Proteomics/spectrum_utils		Proteomics/spectrum_utils
QuantitativeGenomicsGenetics		QuantitativeGenomicsGenetics
Science_Reading		Science_Reading
SingleCellRNASeq		SingleCellRNASeq
WGS		WGS
CaseControl_Design.md		CaseControl_Design.md
ClinicalTrial_Design.md		ClinicalTrial_Design.md
HighLevelIdea_MultiOmics.md		HighLevelIdea_MultiOmics.md
LICENSE		LICENSE
Note_MultimodalDataIntegration.md		Note_MultimodalDataIntegration.md
README.md		README.md
Simulation_Design.md		Simulation_Design.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantitative Biomedicine ✨

Features

Technical procedures

Analyze bulk RNA-seq data

Analyze single cell RNA-seq data

Analyze ATAC-seq data

Analyze proteomics data

Conceptual lens

About

Releases

Packages

Languages

License

SciComp8/NGSOmics_Programming

Folders and files

Latest commit

History

Repository files navigation

Quantitative Biomedicine ✨

Features

Technical procedures

Analyze bulk RNA-seq data

Analyze single cell RNA-seq data

Analyze ATAC-seq data

Analyze proteomics data

Conceptual lens

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages