Traditional positional encoding (PE) methods in Vision Trans- formers (ViT) focus primarily on spatial information, but they may not adequately capture the complex geometric patterns intrinsic to medi- cal images. To address this limitation, we have previously proposed a similarity-based positional encoding combining convolution operations and standard cosine similarity between image patches. In the present work we compare similarity-based PE with two traditional alternatives in ViT such as standard learned PE and rotatory PE. The goal is to show that, in addition to provide better classi cation accuracy of 2D images in di erent medical domains, the attention maps generated by similarity-based PE appears to be more meaningful than those gener- ated by alternative encodings, focusing on the medical relevant part of the images. Finally, we also show the bene ts of the proposed approach in dealing with 3D medical images, again in terms of classi cation perfor- mance. We validate our method on a set of six medical imaging datasets from MedMNIST which are benchmark datasets of medical images of various kinds, such as X-rays, histological samples, dermoscopic, ultra- sounds and microscope images.

Comparing different positional encodings for the interpretation of medical images

Andrea Santomauro
;
Giorgio Leonardi;Luigi Portinale
2025-01-01

Abstract

Traditional positional encoding (PE) methods in Vision Trans- formers (ViT) focus primarily on spatial information, but they may not adequately capture the complex geometric patterns intrinsic to medi- cal images. To address this limitation, we have previously proposed a similarity-based positional encoding combining convolution operations and standard cosine similarity between image patches. In the present work we compare similarity-based PE with two traditional alternatives in ViT such as standard learned PE and rotatory PE. The goal is to show that, in addition to provide better classi cation accuracy of 2D images in di erent medical domains, the attention maps generated by similarity-based PE appears to be more meaningful than those gener- ated by alternative encodings, focusing on the medical relevant part of the images. Finally, we also show the bene ts of the proposed approach in dealing with 3D medical images, again in terms of classi cation perfor- mance. We validate our method on a set of six medical imaging datasets from MedMNIST which are benchmark datasets of medical images of various kinds, such as X-rays, histological samples, dermoscopic, ultra- sounds and microscope images.
File in questo prodotto:
File Dimensione Formato  
Santomauro_at_al.pdf

file disponibile agli utenti autorizzati

Descrizione: paper
Tipologia: Versione Editoriale (PDF)
Licenza: Non specificato
Dimensione 5.45 MB
Formato Adobe PDF
5.45 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11579/217422
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact