0%

Preface

When I trained a new model using bonito, remora or the others, the validation loss and accuracy are great indicators indeed. However, I am wandering what the single base and polyT(A) accuracy are like.

From Gimpel, A.L., Stark, W.J., Heckel, R. et al., it is known that errors are always biased in the entire DNA storage process. In addition, I am curious about the extent of this bias.

Read more »

I use dorado basecaller first.

1
2
dorado basecaller hac barcode04_0.pod5 --device cuda:0  --emit-moves --emit-sam > barcode04_0_calls.bam
dorado basecaller hac barcode06_0.pod5 --device cuda:0 --emit-moves --emit-sam > barcode06_0_calls.bam

And do remora dataset prepare

1
remora dataset prepare /data/nas-shared/zhaoxy/sequencing/293T/regrouped_pod5/barcode04/barcode04_0.pod5 /data/nas-shared/zhaoxy/sequencing/293T/regrouped_pod5/barcode04/barcode04_0_calls.bam --output-path can_chunks --refine-kmer-level-table /data/biolab-nvme-pool1/zhaoxy/kmer_models/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt --refine-rough-rescale  --motif CG 0  --mod-base-control

And error occurs:

1
[13:53:48.823] Extracted 0 chunks from 960,684 reads.
Read more »

First get all reads id from fastq files

Here we use barcode 04, 05 and 06 as examples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
fastq_dirs = {
"barcode04": r'./sequencing/293T/barcode04',
"barcode05": r'./sequencing/293T/barcode05',
"barcode06": r'./sequencing/293T/barcode06'
}

pod5_id_data = { "barcode04": set(), "barcode05": set(), "barcode06": set() }

def find_fastq_files(folder_path):
folder = Path(folder_path)
return [str(file) for file in folder.rglob('*.fastq')]

for barcode, fastq_dir in fastq_dirs.items():
read_ids = set()
for fastq_file in tqdm(find_fastq_files(fastq_dir), desc=f"Processing {barcode}"):
try:
for record in SeqIO.parse(fastq_file, 'fastq'):
read_ids.add(record.id)
except Exception as e:
print(f"Error reading {fastq_file}: {e}")

pod5_id_data[barcode] = read_ids
print(f"Total read ids in {fastq_dir}: {len(read_ids)}")

Then read POD5 files and store them by barcodes

Read more »