A debugging record of network training using REMORA

I use dorado basecaller first.

1
2

dorado basecaller hac barcode04_0.pod5 --device cuda:0  --emit-moves --emit-sam > barcode04_0_calls.bam
dorado basecaller hac barcode06_0.pod5 --device cuda:0  --emit-moves --emit-sam > barcode06_0_calls.bam

And do remora dataset prepare

remora dataset prepare /data/nas-shared/zhaoxy/sequencing/293T/regrouped_pod5/barcode04/barcode04_0.pod5 /data/nas-shared/zhaoxy/sequencing/293T/regrouped_pod5/barcode04/barcode04_0_calls.bam --output-path can_chunks --refine-kmer-level-table /data/biolab-nvme-pool1/zhaoxy/kmer_models/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt --refine-rough-rescale  --motif CG 0  --mod-base-control

And error occurs：

1	[13:53:48.823] Extracted 0 chunks from 960,684 reads.

It is suspected that it is caused by --emit-sam. Re-basecall and use hac.

1 2	dorado basecaller hac barcode04_0.pod5 --device cuda:0 --emit-moves > barcode04_0_calls.bam dorado basecaller hac barcode06_0.pod5 --device cuda:0 --emit-moves > barcode06_0_calls.bam

The same error still occurs after remora dataset prepare was performed

After asking chatgpt, I learn that dorado’s basecaller does not contain the reference sequence, and minimap2 mapping is required. So I use minimap2 for alignment, and the reference sequence is GCF_000001405.26_GRCh38_genomic.fna.

Run as follows

1	minimap2 -ax map-ont /data/biolab-nvme-pool1/zhaoxy/HG38/GCF_000001405.26_GRCh38_genomic.fna /data/nas-shared/zhaoxy/sequencing/293T/regrouped_pod5/barcode06/barcode06_0_calls.bam

There is no output. I suspect that minimap2 cannot parse bam.

1	samtools fastq /data/nas-shared/zhaoxy/sequencing/293T/regrouped_pod5/barcode06/barcode06_0_calls.bam > barcode06_0_reads.fastq

Then use minimap2 normally.

1 2	minimap2 -ax map-ont /data/biolab-nvme-pool1/zhaoxy/HG38/GCF_000001405.26_GRCh38_genomic.fna \ barcode06_0_reads.fastq \| samtools sort -o barcode06_0_calls.sorted.bam

Check quality of mapping.

$ samtools flagstat barcode06_0_calls.sorted.bam
2499617 + 0 in total (QC-passed reads + QC-failed reads)
2124344 + 0 primary
246248 + 0 secondary
129025 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1197500 + 0 mapped (47.91% : N/A)
822227 + 0 primary mapped (38.70% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Running remora dataset prepare still gives an error like below (found it here, I lost mine LOL), querying chatgpt shows that there is no MD tag in the bam file.

Indexing BAM by parent read id: 3383 Reads [00:00, 145643.63 Reads/s]
[11:52:01.676] Extracting read IDs from POD5
[11:52:01.696] Found 3,383 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[11:52:01.697] Making reference-anchored training data
[11:52:01.697] Opening dataset for output
[11:52:02.158] Processing reads
Extracting chunks: 100%|█████████████| 3383/3383 [00:00<00:00, 64772.53 Reads/s]
[11:52:19.245] Unsuccessful read/chunk reasons:
 3,383 : No reference sequence (missing MD tag)
[11:52:19.680] Extracted 0 chunks from 3,383 reads.
[11:52:19.680] Label distribution:
[11:52:19.680] Shuffling dataset
[11:52:19.681] Done

So I decide to dorado basecaller to generate MD tag directly.

1	dorado basecaller fast barcode06_0.pod5 --emit-moves --reference /data/biolab-nvme-pool1/zhaoxy/polyT_proj/bonito/HG38_ref.mmi > barcode06_0_calls.bam

Where the .mmi file is generated by

1	minimap2 -d reference.mmi reference.fasta

Results like below.

Indexing BAM by parent read id: 988718 Reads [00:08, 111136.05 Reads/s]
[21:11:53.894] Extracting read IDs from POD5
[21:11:54.867] Found 960,686 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[21:11:54.923] Making reference-anchored training data
[21:11:54.924] Opening dataset for output
[21:11:55.806] Processing reads
Extracting chunks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 959963/959963 [02:38<00:00, 6048.10 Reads/s][21:14:35.217] Unsuccessful read/chunk reasons:
900,695 : Unmapped read                                                                   
     59 : Sequence too long                                                               
[21:14:36.123] Extracted 375,538 chunks from 960,686 reads.
[21:14:36.125] Label distribution: control:375,538
[21:14:36.125] Shuffling dataset
[21:14:44.266] Done

Make config.

remora \
  dataset make_config \
  train_dataset.jsn \
  can_chunks \
  mod_chunks \
  --dataset-weights 1 1 \
  --log-filename train_dataset.log

Train.

remora \
  model train \
  train_dataset.jsn \
  --model remora/models/ConvLSTM_w_ref.py \
  --device 0 \
  --chunk-context 50 50 \
  --output-path train_results

It works anyway, though accidently. 👇

[22:24:42.546] Start training
DEBUG [22:24:42.587:Process-1:MainThread:data_chunks.py:2596] Dataset worker 0 using super batch offsets 235206, 3642521
DEBUG [22:24:42.601:Process-2:MainThread:data_chunks.py:2596] Dataset worker 1 using super batch offsets 132909, 2131482
DEBUG [22:26:24.227:MainProcess:MainThread:train_model.py:546] Saving best model after 1 epochs with val_acc 0.667724609375
DEBUG [22:28:05.155:MainProcess:MainThread:train_model.py:546] Saving best model after 2 epochs with val_acc 0.6732177734375
DEBUG [22:31:29.841:MainProcess:MainThread:train_model.py:546] Saving best model after 4 epochs with val_acc 0.6737060546875
DEBUG [22:33:11.024:MainProcess:MainThread:train_model.py:546] Saving best model after 5 epochs with val_acc 0.677978515625
DEBUG [22:36:40.797:MainProcess:MainThread:train_model.py:546] Saving best model after 7 epochs with val_acc 0.67822265625
DEBUG [22:40:18.411:MainProcess:MainThread:train_model.py:546] Saving best model after 9 epochs with val_acc 0.6785888671875
DEBUG [22:42:06.939:MainProcess:MainThread:train_model.py:546] Saving best model after 10 epochs with val_acc 0.678955078125
DEBUG [22:47:37.572:MainProcess:MainThread:train_model.py:546] Saving best model after 13 epochs with val_acc 0.6790771484375
DEBUG [23:01:59.566:MainProcess:MainThread:train_model.py:546] Saving best model after 20 epochs with val_acc 0.6798095703125
DEBUG [23:04:08.326:MainProcess:MainThread:train_model.py:546] Saving best model after 21 epochs with val_acc 0.6807861328125
DEBUG [23:21:54.837:MainProcess:MainThread:train_model.py:546] Saving best model after 29 epochs with val_acc 0.68115234375
[23:45:27.898] No validation accuracy improvement after 10 epochs. Training stopped early.
[23:45:27.898] Saving final model checkpoint
[23:45:28.154] Done