Traditional positional encoding (PE) methods in Vision Trans- formers (ViT) focus primarily on spatial information, but they may not adequately capture the complex geometric patterns intrinsic to medi- cal images. To address this limitation, we have previously proposed a similarity-based positional encoding combining convolution operations and standard cosine similarity between image patches. In the present work we compare similarity-based PE with two traditional alternatives in ViT such as standard learned PE and rotatory PE. The goal is to show that, in addition to provide better classi cation accuracy of 2D images in di erent medical domains, the attention maps generated by similarity-based PE appears to be more meaningful than those gener- ated by alternative encodings, focusing on the medical relevant part of the images. Finally, we also show the bene ts of the proposed approach in dealing with 3D medical images, again in terms of classi cation perfor- mance. We validate our method on a set of six medical imaging datasets from MedMNIST which are benchmark datasets of medical images of various kinds, such as X-rays, histological samples, dermoscopic, ultra- sounds and microscope images.
Comparing different positional encodings for the interpretation of medical images
Andrea Santomauro
;Giorgio Leonardi;Luigi Portinale
2025-01-01
Abstract
Traditional positional encoding (PE) methods in Vision Trans- formers (ViT) focus primarily on spatial information, but they may not adequately capture the complex geometric patterns intrinsic to medi- cal images. To address this limitation, we have previously proposed a similarity-based positional encoding combining convolution operations and standard cosine similarity between image patches. In the present work we compare similarity-based PE with two traditional alternatives in ViT such as standard learned PE and rotatory PE. The goal is to show that, in addition to provide better classi cation accuracy of 2D images in di erent medical domains, the attention maps generated by similarity-based PE appears to be more meaningful than those gener- ated by alternative encodings, focusing on the medical relevant part of the images. Finally, we also show the bene ts of the proposed approach in dealing with 3D medical images, again in terms of classi cation perfor- mance. We validate our method on a set of six medical imaging datasets from MedMNIST which are benchmark datasets of medical images of various kinds, such as X-rays, histological samples, dermoscopic, ultra- sounds and microscope images.| File | Dimensione | Formato | |
|---|---|---|---|
|
Santomauro_at_al.pdf
file disponibile agli utenti autorizzati
Descrizione: paper
Tipologia:
Versione Editoriale (PDF)
Licenza:
Non specificato
Dimensione
5.45 MB
Formato
Adobe PDF
|
5.45 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


