Summary of Omnibal: Towards Fast Instruct-tuning For Vision-language Models Via Omniverse Computation Balance, by Yongqiang Yao et al.
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance
by Yongqiang Yao, Jingru Tan, Jiahao Hu, Feizhao Zhang, Yazhe Niu, Xin Jin, Bo Li, Ruihao Gong, Pengfei Liu, Dahua Lin, Ningyi Xu
First submitted to arxiv on: 30 Jul 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The recently developed vision-language instruct-tuning models have achieved significant progress, thanks to their comprehensive understanding of the world. However, large-scale 3D parallel training on these models leads to an imbalanced computation load across different devices. This is due to the inherent heterogeneity between the vision and language parts, which affects distributed training efficiency. To address this issue, we rebalanced the computational loads from data, model, and memory perspectives, achieving a more balanced computation across devices. Our approach involves grouping instances into new balanced mini-batches within and across devices, employing a search-based method to achieve a balanced partitioning of the model, and adaptively adjusting the re-computation strategy for each partition to utilize available memory fully. We conducted extensive experiments to validate the effectiveness of our method, achieving about 1.8x speed-up compared to the open-source training code of InternVL-Chat. Our method’s efficacy and generalizability were further demonstrated across various models and datasets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large vision-language instruct-tuning models have made big progress, but they can be slow because some devices are working harder than others. This is a problem because the model is like two different parts: one for pictures and one for words. These parts need to work together, but it’s hard when they’re not equal. The researchers found that if they balanced how much each part does, they could make the model run faster. They did this by grouping similar things together, using a special method to make sure everything is fair, and adjusting how often the model needs to re-do some work to use up all the memory available. This made their model 1.8 times faster than before! They also tested it with different models and data and it worked well. |