Enhancing Medical Image Report Generation through Standard Language Models: Leveraging the Power of LLMs in Healthcare

Leonardi, Giorgio; Portinale, Luigi; Santomauro, Andrea

In recent years, Artificial Intelligence has witnessed a deep transformation, primarily driven by advancements in deep learning architectures. Among these, the Transformer architecture has emerged as a pivotal milestone, revolutionizing natural language processing and several other tasks and domains. The Transformer’s ability to capture contextual dependencies across sequences, paired with its parallelizable design, made it exceptionally versatile. This plays a fundamental role in the healthcare field, where the ability to integrate and process data from various modalities, such as medical images, clinical notes and patient records, is of paramount importance in order to enable AI models to provide more informed answers. This complexity raises the demand for models that can integrate information from multiple modalities, such as text, images and audio such as multimodal transformers, which are sophisticated architectures able to process and fuse information across different modalities. Furthermore, an important goal to be achieved in the healthcare domain is to focus on pre-trained models, given the scarcity of large datasets in this field, and the need to minimise the computational resources, since healthcare organizations are not equipped with high-performance computation devices. This paper presents a methodology for harnessing pre-trained large language models based on the transformer architecture, in order to facilitate the integration of different data sources, with a specific focus on the fusion of radiological images and textual reports. The ensuing approach involves the fine-tuning of pre-existing textual models, enabling their seamless extension into diverse domains.