Transformer for image processing (ViTs) Part One

2 min readFeb 23, 2023

This is a simple explanation to what ViTs are. In Part Two we will focus on a more technical approach and code implementation of Vision Transformers(ViTs)

Vision Transformers(ViTs) are models for image processing that use transformer-like architectures. Transformers were originally designed for natural language processing tasks, where they learn the relationship between input token pairs.

In Computer Vision, ViTs split an image into fixed-sized patches, embed each patch linearly, add position embeddings and feed the resulting sequence of vectors into a standard transformer encoder.

To perform image classification say, ViTs add an extra learnable “classification token” to the sequence.

ViTs are compared with other image processing models like CNN, the comparison depends on several factors such as:

The size and complexity of the ViT model. Larger ViT models tend to perform better than smaller ones, but they also require more computational resources and data.
The pre-training dataset and model. ViTs benefit from large-scale pre-training datasets and models such as ImageNet-21K or JFT-300M, which can improve their performance significantly. However these datasets or models may not be easily accessible or generalizable for some tasks or domains.
The specific task or domain. ViTs may have advantages over CNNs for some task that require global or semantic understanding of images, such as generative modeling or multi-modal tasks. However, they may also face challenges for some tasks or domains that require fine-grained or spatial information, such localization or segmentation.

ViTs are fairly new, although they currently perform great at some tasks, they still have some limitations that include:

They often require too many tokens to obtain reasonable results, which increases the computational cost and memory.
They may lose some spatial information when they split and image into patches, which can affect tasks that require fine-grained localization or segmentation
They may depend on large scale pre-training datasets and models,which may limit their accessibility and generalization.

Overall, ViTs are a promising alternative to CNN for image processing, but they are not necessarily better or worse than them. They have different strengths and weaknesses that depend on a various factors. Therefore it is important to evaluate them based on their suitability for each task.

Transformer for image processing (ViTs) Part One

Written by Mitterand Ekole