Comparing different positional encodings for the interpretation of medical images

Santomauro, Andrea; Leonardi, Giorgio; Portinale, Luigi

Traditional positional encoding (PE) methods in Vision Trans- formers (ViT) focus primarily on spatial information, but they may not adequately capture the complex geometric patterns intrinsic to medi- cal images. To address this limitation, we have previously proposed a similarity-based positional encoding combining convolution operations and standard cosine similarity between image patches. In the present work we compare similarity-based PE with two traditional alternatives in ViT such as standard learned PE and rotatory PE. The goal is to show that, in addition to provide better classi cation accuracy of 2D images in di erent medical domains, the attention maps generated by similarity-based PE appears to be more meaningful than those gener- ated by alternative encodings, focusing on the medical relevant part of the images. Finally, we also show the bene ts of the proposed approach in dealing with 3D medical images, again in terms of classi cation perfor- mance. We validate our method on a set of six medical imaging datasets from MedMNIST which are benchmark datasets of medical images of various kinds, such as X-rays, histological samples, dermoscopic, ultra- sounds and microscope images.