Summary of Let’s Go Shopping (lgs) — Web-scale Image-text Dataset For Visual Concept Understanding, by Yatong Bai et al.
Let’s Go Shopping (LGS) – Web-Scale Image-Text Dataset for Visual Concept Understanding
by Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson, Aerin Kim, Somayeh Sojoudi, Kyunghyun Cho
First submitted to arxiv on: 9 Jan 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper addresses the challenge of collecting large-scale annotated datasets for neural network-based applications like image classification and captioning. The current methods are time-consuming and limited, making it difficult for researchers and practitioners to choose from a small number of options. To overcome this issue, the authors propose using commercial shopping websites as a source of data that meet three criteria: cleanliness, informativeness, and fluency. The resulting dataset, called Let’s Go Shopping (LGS), contains 15 million image-caption pairs from publicly available e-commerce websites. Compared to existing general-domain datasets, LGS images focus on the foreground object and have less complex backgrounds. The authors demonstrate that classifiers trained on existing benchmark datasets do not generalize well to e-commerce data, but self-supervised visual feature extractors can achieve better results. Additionally, the high-quality e-commerce-focused images and bimodal nature of LGS make it advantageous for vision-language bi-modal tasks like image-captioning and text-to-image generation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper tries to solve a big problem in computer science. Right now, it’s hard to find enough pictures and words that go together so we can teach computers to recognize things or describe what they see. The usual way to get these pairs is very time-consuming and not very good. So the authors came up with a new idea: use pictures from shopping websites! They made a big dataset called LGS, which has 15 million picture-word pairs that are perfect for training machines to do things like recognize objects or describe what they see. The cool thing about this dataset is that it’s really high-quality and has lots of words that go with the same picture, so we can teach computers to write more detailed descriptions. |
Keywords
» Artificial intelligence » Image captioning » Image classification » Image generation » Neural network » Self supervised