Generative genomic language models (gLMs) can produce nucleotide sequences that appear statistically plausible while violating essential biological constraints. We present VirGen2, a compact model specialized via continual pretraining on a pan-viral corpus. Generations were evaluated across sampling temperatures and different prompt lengths against natural sequences, the base pretrained model, and 3/6-mer Markov baselines. Validation combined generation diagnostics, BLASTn similarity against NCBI Viral RefSeq, viral signature detection with geNomad, open reading frame assessment, mutation profiling, and structural evaluation with AlphaFold2. VirGen2 generated substantially more biologically coherent sequences than baselines, particularly at low temperature. Furthermore, a k-mer containment audit demonstrated internalization of viral syntax rather than memorization of training data. Several outputs on a focused candidate case-study showed viral-like signatures, intact coding regions, and predicted protein structures consistent with known viral proteins. These findings suggest that domain-adapted compact gLMs can learn biologically meaningful viral constraints, while highlighting limitations such as controllability, novelty assessment, and experimental validation.
Biological Plausibility Assessment of Viral Sequences Generated by a Genomic Language Model
Riccardo Bellazzi;Enea Parimbelli;Luigi Portinale
2026-01-01
Abstract
Generative genomic language models (gLMs) can produce nucleotide sequences that appear statistically plausible while violating essential biological constraints. We present VirGen2, a compact model specialized via continual pretraining on a pan-viral corpus. Generations were evaluated across sampling temperatures and different prompt lengths against natural sequences, the base pretrained model, and 3/6-mer Markov baselines. Validation combined generation diagnostics, BLASTn similarity against NCBI Viral RefSeq, viral signature detection with geNomad, open reading frame assessment, mutation profiling, and structural evaluation with AlphaFold2. VirGen2 generated substantially more biologically coherent sequences than baselines, particularly at low temperature. Furthermore, a k-mer containment audit demonstrated internalization of viral syntax rather than memorization of training data. Several outputs on a focused candidate case-study showed viral-like signatures, intact coding regions, and predicted protein structures consistent with known viral proteins. These findings suggest that domain-adapted compact gLMs can learn biologically meaningful viral constraints, while highlighting limitations such as controllability, novelty assessment, and experimental validation.| File | Dimensione | Formato | |
|---|---|---|---|
|
paper_164.pdf
file disponibile agli utenti autorizzati
Descrizione: Paper
Tipologia:
Versione Editoriale (PDF)
Licenza:
Non specificato
Dimensione
3.79 MB
Formato
Adobe PDF
|
3.79 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


