Loading Now

Summary of Chosen: Compilation to Hardware Optimization Stack For Efficient Vision Transformer Inference, by Mohammad Erfan Sadeghi et al.


CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

by Mohammad Erfan Sadeghi, Arash Fayyazi, Suhas Somashekar, Massoud Pedram

First submitted to arxiv on: 17 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
ViTs have revolutionized computer vision by applying self-attention mechanisms used in NLP to analyze image patches. Despite their advantages, deploying ViTs on FPGAs is challenging due to non-linear calculations and high computational/memory demands. This paper introduces CHOSEN, a software-hardware co-design framework addressing these challenges for optimal ViT deployment on FPGAs. The framework features multi-kernel design, approximate non-linear functions with minimal accuracy degradation, efficient logic block utilization, and an innovative compiler for performance/memory-efficiency optimization via novel algorithmic design space exploration. Compared to state-of-the-art accelerators, CHOSEN achieves 1.5x and 1.42x throughput improvements on DeiT-S and DeiT-B models.
Low GrooveSquid.com (original content) Low Difficulty Summary
ViTs are a new way of doing computer vision that’s like a superpower for images. They work by looking at small pieces of the image and figuring out what they mean, kind of like how we read words to understand sentences. But when we try to use them on special computers called FPGAs, it gets tricky because these computers are really good at doing simple math, but ViTs need to do complicated math too. This paper shows a new way to make ViTs work better on FPGAs by breaking down the math into smaller pieces and using the computer’s strengths to speed things up. It even shows that this method can make ViT work 1.5 times faster than other ways!

Keywords

» Artificial intelligence  » Nlp  » Optimization  » Self attention  » Vit