Summary of Inducing High Energy-latency Of Large Vision-language Models with Verbose Images, by Kuofeng Gao et al.
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
by Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu
First submitted to arxiv on: 20 Jan 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Cryptography and Security (cs.CR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper explores the attack surface of large vision-language models (VLMs) such as GPT-4, focusing on manipulating energy consumption and latency time during inference. The researchers propose a novel approach using “verbose images” to induce VLMs to generate longer sequences, which can lead to significant increases in energy-latency cost. To achieve this, they design three loss objectives: delaying the occurrence of the end-of-sequence token, increasing uncertainty over each generated token, and promoting token diversity within the sequence. They also propose a temporal weight adjustment algorithm to balance these losses. Experimental results demonstrate that verbose images can increase sequence length by 7.87 times on MS-COCO and 8.56 times on ImageNet datasets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper investigates ways to make large vision-language models (VLMs) like GPT-4 harder to use. One way is to make them use more energy and take longer to respond, which could be a problem if many people try to use them at the same time. The researchers came up with an idea called “verbose images” that can make VLMs generate longer texts, which would increase their energy usage. To do this, they created three rules for the VLMs: delay when it’s supposed to stop writing, make each word more uncertain, and make sure the words are different from one another. They also developed a way to balance these rules so they work well together. When tested on two big datasets, verbose images were able to increase text length by a lot. |
Keywords
» Artificial intelligence » Gpt » Inference » Token