Summary of Bigdocs: An Open Dataset For Training Multimodal Models on Document and Code Tasks, by Juan Rodriguez et al.
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
by Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak, Rabiul Awal, Mahsa Massoud, Amirhossein Abaskohi, Zichao Li, Suyuchen Wang, Pierre-André Noël, Mats Leon Richter, Saverio Vadacchino, Shubham Agarwal, Sanket Biswas, Sara Shanian, Ying Zhang, Noah Bolger, Kurt MacDonald, Simon Fauvel, Sathwik Tejaswi, Srinivas Sunkara, Joao Monteiro, Krishnamurthy DJ Dvijotham, Torsten Scholak, Nicolas Chapados, Sepideh Kharagani, Sean Hughes, M. Özsu, Siva Reddy, Marco Pedersoli, Yoshua Bengio, Christopher Pal, Issam Laradji, Spandana Gella, Perouz Taslakian, David Vazquez, Sai Rajeswar
First submitted to arxiv on: 5 Dec 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the potential of multimodal AI in enhancing document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks requiring long-structured outputs can also benefit from multimodality. However, commercial applications are often limited due to restricted access to training data and licensing issues. To address these limitations, the authors introduce BigDocs-7.5M, a high-quality open-access dataset comprising 7.5 million multimodal documents across 30 tasks. The dataset is curated using an efficient process emphasizing accountability, responsibility, and transparency. Additionally, the authors present BigDocs-Bench, a benchmark suite with 10 novel tasks reflecting real-world use cases involving GUI reasoning and code generation from images. Experiments show that training on BigDocs-Bench improves average performance up to 25.8% over GPT-4o in document reasoning and structured output tasks. Human evaluations also prefer outputs from models trained on BigDocs. This suggests that BigDocs can help academics and the open-source community utilize and improve AI tools for multimodal capabilities and document reasoning. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about using special kinds of artificial intelligence (AI) to understand and process documents, like receipts or reports. The problem is that this technology is not available to everyone because it’s hard to get the data needed to train these AI models. To solve this, the authors created a big dataset called BigDocs-7.5M with millions of documents that can be used by anyone. They also made a test suite called BigDocs-Bench that shows how well these AI models work on real-life tasks. The results show that their approach is better than others at understanding and processing documents, which could help people in many different fields. |
Keywords
» Artificial intelligence » Gpt