Generative genomic Language Models (gLMs) have shown immense promise in capturing the statistical regularities of DNA, yet a persistent challenge remains: distinguishing statistically plausible "biological hallucinations" from truly functional synthetic sequences. In this work, we evaluate the potential biological viability of HIV-1 sequences generated by a compact 17M-parameter, decoder-only transformer trained on a pan-viral corpus of ~400,000 complete genomes. Focusing on a stable low temperature setting, we analyzed 142 AI-generated sequences (approximately 2.5 kb long) spanning the HIV-1 gag/pol region. Our validation pipeline employed BLASTn homology searches, mutational profiling against the HXB2 reference, and subtyping/recombination tools. Results show that 63.9% of the sequences (69 out of 108) covering the full gag Coding DNA Sequence (CDS) were free of premature stop codons, indicating a high success rate in generating potentially functional Open Reading Frames (ORFs). Mutational analysis revealed a diverse landscape of 28–48 mutations per sequence with a balanced synonymous/non-synonymous ratio. Notably, calculated Shannon entropy across gag subregions (p17, p24, p7, p6) mirrored natural conservation patterns, with the capsid (p24) exhibiting the lowest variability (0.13) and p6 the highest (0.33). While training data imbalance led to a predominance of Subtype B, the model successfully generated a recombination variant of Subtype A. AlphaFold2 structural predictions confirmed that generated proteins maintain biophysical stability, yielding pLDDT scores comparable to natural references. Finally, we integrated a benchmark against the non-adapted base Mistral model, state-of-art gLM Evo2, and stochastic Markov 6-mer model, together with a memorization audit to prove output innovation. These findings demonstrate that compact gLMs adapted to viral genomics can internalize intricate biological constraints and conservation signatures, providing a resource-efficient pathway for designing functional viral components for vaccine and gene therapy research.
Exploring the Evolutionary Landscape of AI-Generated Viral Sequences: a Case Study on HIV-1
Riccardo Bellazzi;Enea Parimbelli;Luigi Portinale
2026-01-01
Abstract
Generative genomic Language Models (gLMs) have shown immense promise in capturing the statistical regularities of DNA, yet a persistent challenge remains: distinguishing statistically plausible "biological hallucinations" from truly functional synthetic sequences. In this work, we evaluate the potential biological viability of HIV-1 sequences generated by a compact 17M-parameter, decoder-only transformer trained on a pan-viral corpus of ~400,000 complete genomes. Focusing on a stable low temperature setting, we analyzed 142 AI-generated sequences (approximately 2.5 kb long) spanning the HIV-1 gag/pol region. Our validation pipeline employed BLASTn homology searches, mutational profiling against the HXB2 reference, and subtyping/recombination tools. Results show that 63.9% of the sequences (69 out of 108) covering the full gag Coding DNA Sequence (CDS) were free of premature stop codons, indicating a high success rate in generating potentially functional Open Reading Frames (ORFs). Mutational analysis revealed a diverse landscape of 28–48 mutations per sequence with a balanced synonymous/non-synonymous ratio. Notably, calculated Shannon entropy across gag subregions (p17, p24, p7, p6) mirrored natural conservation patterns, with the capsid (p24) exhibiting the lowest variability (0.13) and p6 the highest (0.33). While training data imbalance led to a predominance of Subtype B, the model successfully generated a recombination variant of Subtype A. AlphaFold2 structural predictions confirmed that generated proteins maintain biophysical stability, yielding pLDDT scores comparable to natural references. Finally, we integrated a benchmark against the non-adapted base Mistral model, state-of-art gLM Evo2, and stochastic Markov 6-mer model, together with a memorization audit to prove output innovation. These findings demonstrate that compact gLMs adapted to viral genomics can internalize intricate biological constraints and conservation signatures, providing a resource-efficient pathway for designing functional viral components for vaccine and gene therapy research.| File | Dimensione | Formato | |
|---|---|---|---|
|
112_Exploring_the_Evolutionary.pdf
file disponibile agli utenti autorizzati
Descrizione: Preprint
Tipologia:
Documento in Pre-print
Licenza:
Non specificato
Dimensione
5.32 MB
Formato
Adobe PDF
|
5.32 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


