Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demultiplex both forward and reverse strands with rear end only barcodes #1249

Open
eden528 opened this issue Feb 11, 2025 · 1 comment
Open
Labels
barcode Issues related to barcoding

Comments

@eden528
Copy link

eden528 commented Feb 11, 2025

Hello,
I am working on demultiplexing using dorado v0.9.1 with custom barcodes. I have rear only barcodes.

Here is my DNA read structure:

Forward strand:
5' --- adapter --- read --- polyA --- mask1_front --- barcode --- mask1_rear --- 3'

Reverse strand:
5' --- adapter --- rc_mask1_rear --- rc_barcode ---  rc_mask1_front --- polyT --- rc_read --- 3'

Scenario 1:
Based on my understanding from the documentation, I used the sequence directly preceding my barcode in the forward strand as mask1_front and the sequences directly following my barcode in the forward strand as mask1_rear. I did not set mask2*. Then in my barcode fasta file, I included my four barcodes but only how they would be found in the forward stand not the reverse strand. However, when I do this only ~ 50% of my reads are demuxed and the classified reads are only the reads that match the forward strand structure.

Relevant part of arrangement.toml file:

mask1_front = "AAAAAAACCG"
mask1_rear = "CTCTGCGTTG"

barcode1_pattern = "AT_BC%02i"
first_index = 1
last_index = 4
rear_only_barcodes = true

barcodes.fasta file:

>AT_BC01
GTGTTACCGTGGGAATGAATCCTT
>AT_BC02
TTCAGGGAACAAACCAAGTTACGT
>AT_BC03
AACTAGGCACAGCGAGTCTTGGTT
>AT_BC04
AAGCGTTGAAACCTTTGTCCTCTC

Scenario 2:
If I instead set mask1_front as the reverse complement of the sequence following the barcode in the forward strand (aka the rc of what was used as mask1_rear in the first scenario) and set mask1_rear as the reverse complement of the sequence preceding the barcode in the forward strand (aka the rc of what was used as mask1_front in the first scenario) and include the 4 barcodes as well as each of their reverse complement sequences in the fasta file, I still only get ~ 50% demuxed reads. However, only reverse strand reads are classified, and the reverse complement of the barcodes (BC_02, BC_04, BC_06, BC_08) are what is identified.

Relevant part of arrangement.toml file:

mask1_front = "CAACGCAGAG"
mask1_rear = "CGGTTTTTTT"

# Barcode sequences
barcode1_pattern = "AT_BC%02i"
first_index = 1
last_index = 8
rear_only_barcodes = true

barcodes.fasta file:

>AT_BC01
GTGTTACCGTGGGAATGAATCCTT
>AT_BC02
AAGGATTCATTCCCACGGTAACAC
>AT_BC03
TTCAGGGAACAAACCAAGTTACGT
>AT_BC04
ACGTAACTTGGTTTGTTCCCTGAA
>AT_BC05
AACTAGGCACAGCGAGTCTTGGTT
>AT_BC06
AACCAAGACTCGCTGTGCCTAGTT
>AT_BC07
AAGCGTTGAAACCTTTGTCCTCTC
>AT_BC08
GAGAGGACAAAGGTTTCAACGCTT

This suggests that dorado is not considering both DNA strands, and does not automatically look for the reverse complement of custom barcodes or search for the reversed flanking sequences that would be expected in the complementary strand.

Scenario 3:
If I use the forward sequences described in scenario 1 as mask1_front and mask1_rear and the reverse complement sequences described in the scenario 2 as mask2_front and mask2_rear, I have to add barcode2_pattern to the toml file, and I believe this configuration means dorado will be expecting to find double ended barcodes. Also this produces a bam file with no reads.

Relevant part of arrangement.toml file:

mask1_front = "AAAAAAACCG"
mask1_rear = "CTCTGCGTTG"
mask2_front = "CAACGCAGAG"
mask2_rear = "CGGTTTTTTT"

# Barcode sequences
barcode1_pattern = "AT_BC%02i"
barcode2_pattern = "AT_BC%02i"
first_index = 1
last_index = 8
rear_only_barcodes = true

barcodes.fasta file:

>AT_BC01
GTGTTACCGTGGGAATGAATCCTT
>AT_BC02
AAGGATTCATTCCCACGGTAACAC
>AT_BC03
TTCAGGGAACAAACCAAGTTACGT
>AT_BC04
ACGTAACTTGGTTTGTTCCCTGAA
>AT_BC05
AACTAGGCACAGCGAGTCTTGGTT
>AT_BC06
AACCAAGACTCGCTGTGCCTAGTT
>AT_BC07
AAGCGTTGAAACCTTTGTCCTCTC
>AT_BC08
GAGAGGACAAAGGTTTCAACGCTT

Can you please help me understand how to properly design the arrangement.toml and barcode.fasta files to demultiplex both forward and reverse strands with rear end only barcodes?

@malton-ont
Copy link
Collaborator

malton-ont commented Feb 12, 2025

Hi @eden528,

This isn't really the intended use-case for rear_only_barcodes. This flag means we expect the barcode to only be at the 3' end (on either strand).

I think you can still make this work if you swap and RC the masks and barcodes in your scenario 3 and remove the rear_only_barcodes line - i.e. treat the 3' end like it is mask2 and the 5' end like it is mask1 - as dorado will look for barcodes at either end as long as you do not use the --barcode-both-ends flag.

@malton-ont malton-ont added the barcode Issues related to barcoding label Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
barcode Issues related to barcoding
Projects
None yet
Development

No branches or pull requests

2 participants