Generative genomic language models (gLMs) can produce nucleotide sequences that appear statistically plausible while violating essential biological constraints. We present VirGen2, a compact model specialized via continual pretraining on a pan-viral corpus. Generations were evaluated across sampling temperatures and different prompt lengths against natural sequences, the base pretrained model, and 3/6-mer Markov baselines. Validation combined generation diagnostics, BLASTn similarity against NCBI Viral RefSeq, viral signature detection with geNomad, open reading frame assessment, mutation profiling, and structural evaluation with AlphaFold2. VirGen2 generated substantially more biologically coherent sequences than baselines, particularly at low temperature. Furthermore, a k-mer containment audit demonstrated internalization of viral syntax rather than memorization of training data. Several outputs on a focused candidate case-study showed viral-like signatures, intact coding regions, and predicted protein structures consistent with known viral proteins. These findings suggest that domain-adapted compact gLMs can learn biologically meaningful viral constraints, while highlighting limitations such as controllability, novelty assessment, and experimental validation.

Biological Plausibility Assessment of Viral Sequences Generated by a Genomic Language Model

Riccardo Bellazzi;Enea Parimbelli;Luigi Portinale
2026-01-01

Abstract

Generative genomic language models (gLMs) can produce nucleotide sequences that appear statistically plausible while violating essential biological constraints. We present VirGen2, a compact model specialized via continual pretraining on a pan-viral corpus. Generations were evaluated across sampling temperatures and different prompt lengths against natural sequences, the base pretrained model, and 3/6-mer Markov baselines. Validation combined generation diagnostics, BLASTn similarity against NCBI Viral RefSeq, viral signature detection with geNomad, open reading frame assessment, mutation profiling, and structural evaluation with AlphaFold2. VirGen2 generated substantially more biologically coherent sequences than baselines, particularly at low temperature. Furthermore, a k-mer containment audit demonstrated internalization of viral syntax rather than memorization of training data. Several outputs on a focused candidate case-study showed viral-like signatures, intact coding regions, and predicted protein structures consistent with known viral proteins. These findings suggest that domain-adapted compact gLMs can learn biologically meaningful viral constraints, while highlighting limitations such as controllability, novelty assessment, and experimental validation.
File in questo prodotto:
File Dimensione Formato  
paper_164.pdf

file disponibile agli utenti autorizzati

Descrizione: Paper
Tipologia: Versione Editoriale (PDF)
Licenza: Non specificato
Dimensione 3.79 MB
Formato Adobe PDF
3.79 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11579/230403
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact