Summary of Bigdocs: An Open Dataset For Training Multimodal Models on Document and Code Tasks, by Juan Rodriguez et al.

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

by Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak, Rabiul Awal, Mahsa Massoud, Amirhossein Abaskohi, Zichao Li, Suyuchen Wang, Pierre-André Noël, Mats Leon Richter, Saverio Vadacchino, Shubham Agarwal, Sanket Biswas, Sara Shanian, Ying Zhang, Noah Bolger, Kurt MacDonald, Simon Fauvel, Sathwik Tejaswi, Srinivas Sunkara, Joao Monteiro, Krishnamurthy DJ Dvijotham, Torsten Scholak, Nicolas Chapados, Sepideh Kharagani, Sean Hughes, M. Özsu, Siva Reddy, Marco Pedersoli, Yoshua Bengio, Christopher Pal, Issam Laradji, Spandana Gella, Perouz Taslakian, David Vazquez, Sai Rajeswar

First submitted to arxiv on: 5 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper explores the potential of multimodal AI in enhancing document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks requiring long-structured outputs can also benefit from multimodality. However, commercial applications are often limited due to restricted access to training data and licensing issues. To address these limitations, the authors introduce BigDocs-7.5M, a high-quality open-access dataset comprising 7.5 million multimodal documents across 30 tasks. The dataset is curated using an efficient process emphasizing accountability, responsibility, and transparency. Additionally, the authors present BigDocs-Bench, a benchmark suite with 10 novel tasks reflecting real-world use cases involving GUI reasoning and code generation from images. Experiments show that training on BigDocs-Bench improves average performance up to 25.8% over GPT-4o in document reasoning and structured output tasks. Human evaluations also prefer outputs from models trained on BigDocs. This suggests that BigDocs can help academics and the open-source community utilize and improve AI tools for multimodal capabilities and document reasoning.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about using special kinds of artificial intelligence (AI) to understand and process documents, like receipts or reports. The problem is that this technology is not available to everyone because it’s hard to get the data needed to train these AI models. To solve this, the authors created a big dataset called BigDocs-7.5M with millions of documents that can be used by anyone. They also made a test suite called BigDocs-Bench that shows how well these AI models work on real-life tasks. The results show that their approach is better than others at understanding and processing documents, which could help people in many different fields.

Keywords

* Artificial intelligence * Gpt

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Solving High-dimensional Inverse Problems Using Amortized Likelihood-free Inference with Noisy and Incomplete Data, by Jice Zeng et al.

Summary of Two Stages Domain Invariant Representation Learners Solve the Large Co-variate Shift in Unsupervised Domain Adaptation with Two Dimensional Data Domains, by Hisashi Oshima et al.

Related Posts