Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation (2026)
by Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, and Manmohan Chandraker
Abstract:
Vision transformer-based models bring significant improvements to image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three-step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create a derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on input images. Additionally, to change the computational-accuracy trade-off, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
Download Information
| Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, and Manmohan Chandraker (2026). "Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation." 5th Workshop of WACV 2026 on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model.
|  |
|
|
|
|
Bibtex citation
@inproceedings{Yaoetal26workshop,
author = "Manyi Yao and Abhishek Aich and Yumin Suh and Amit Roy-Chowdhury and Christian Shelton and Manmohan Chandraker",
title = "Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation",
booktitle = "5th Workshop of WACV 2026 on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model",
year = 2026,
}
full list