Loading Now

Summary of Slicing Vision Transformer For Flexible Inference, by Yitian Zhang et al.


Slicing Vision Transformer for Flexible Inference

by Yitian Zhang, Huseyin Coskun, Xu Ma, Huan Wang, Ke Ma, Chen, Derek Hao Hu, Yun Fu

First submitted to arxiv on: 6 Dec 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a framework called Scala to enable a single Vision Transformer (ViT) network to represent multiple smaller ViTs with flexible inference capabilities. This is achieved by activating several subnets during training, introducing isolated activation to disentangle the smallest subnet from others, and leveraging scale coordination to ensure each subnet receives simplified learning objectives. The proposed approach demonstrates comprehensive empirical validations on different tasks, showing that Scala can learn slimmable representations without modifying the original ViT structure, matching the performance of separate training. This approach achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters compared to prior art.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper tries to make a big computer vision model smaller and more flexible so it can work well in situations where resources are limited. The researchers found that small versions of this model are actually just parts of the bigger model, so they developed a way to use one model to do many things at once. They tested their approach on different tasks and showed that it works as well as training separate models for each task. This new approach is faster and uses fewer computer resources than the old way.

Keywords

» Artificial intelligence  » Inference  » Vision transformer  » Vit