Biological Plausibility Assessment of Viral Sequences Generated by a Genomic Language Model

Pablo Arozarena Donelli,; Rancati, Simone; Nicora, Giovanna; Bellazzi, Riccardo; Parimbelli, Enea; Portinale, Luigi

doi:10.1007/978-3-032-30813-9_47

Generative genomic language models (gLMs) can produce nucleotide sequences that appear statistically plausible while violating essential biological constraints. We present VirGen2, a compact model specialized via continual pretraining on a pan-viral corpus. Generations were evaluated across sampling temperatures and different prompt lengths against natural sequences, the base pretrained model, and 3/6-mer Markov baselines. Validation combined generation diagnostics, BLASTn similarity against NCBI Viral RefSeq, viral signature detection with geNomad, open reading frame assessment, mutation profiling, and structural evaluation with AlphaFold2. VirGen2 generated substantially more biologically coherent sequences than baselines, particularly at low temperature. Furthermore, a k-mer containment audit demonstrated internalization of viral syntax rather than memorization of training data. Several outputs on a focused candidate case-study showed viral-like signatures, intact coding regions, and predicted protein structures consistent with known viral proteins. These findings suggest that domain-adapted compact gLMs can learn biologically meaningful viral constraints, while highlighting limitations such as controllability, novelty assessment, and experimental validation.