Loading Now


Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

by Gemini Team Google, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love, Paul Voigtlaender, Rohan Jain, Gabriela Surita, Kareem Mohamed, Rory Blevins, Junwhan Ahn, Tao Zhu, Kornraphop Kawintiranon, Orhan Firat, Yiming Gu, Yujing Zhang, Matthew Rahtz, Manaal Faruqui, Natalie Clay, Justin Gilmer, JD Co-Reyes, Ivo Penchev, Rui Zhu, Nobuyuki Morioka, Kevin Hui, Krishna Haridasan, Victor Campos, Mahdis Mahdieh, Mandy Guo, Samer Hassan, Kevin Kilgour, Arpi Vezer, Heng-Tze Cheng, Raoul de Liedekerke, Siddharth Goyal, Paul Barham, DJ Strouse, Seb Noury, Jonas Adler, Mukund Sundararajan, Sharad Vikram, Dmitry Lepikhin, Michela Paganini, Xavier Garcia, Fan Yang, Dasha Valter, Maja Trebacz, Kiran Vodrahalli, Chulayuth Asawaroengchai, Roman Ring, Norbert Kalb, Livio Baldini Soares, Siddhartha Brahma, David Steiner, Tianhe Yu, Fabian Mentzer, Antoine He, Lucas Gonzalez, Bibo Xu, Raphael Lopez Kaufman, Laurent El Shafey, Junhyuk Oh, Tom Hennigan, George van den Driessche, Seth Odoom, Mario Lucic, Becca Roelofs, Sid Lall, Amit Marathe, Betty Chan, Santiago Ontanon, Luheng He, Denis Teplyashin, Jonathan Lai, Phil Crone, Bogdan Damoc, Lewis Ho, Sebastian Riedel, Karel Lenc, Chih-Kuan Yeh

First submitted to arxiv on: 8 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The Gemini 1.5 family of models is a significant advancement in multimodal processing, capable of handling vast amounts of contextual information across various modalities, including long documents and hours-long videos and audio recordings. The update includes two new models: Gemini 1.5 Pro, which outperforms its predecessor on most capabilities and benchmarks, and Gemini 1.5 Flash, a more efficient variant with minimal quality regression. These models demonstrate near-perfect recall in long-context retrieval tasks, improve state-of-the-art performance in long-document QA, long-video QA, and long-context ASR, and match or surpass the performance of Gemini 1.0 Ultra across various benchmarks. Furthermore, the study explores the limits of Gemini 1.5’s long-context ability, showing continued improvement in next-token prediction and near-perfect retrieval up to at least 10M tokens. The paper also highlights real-world applications, such as Gemini 1.5 collaborating with professionals to complete tasks, achieving time savings of 26-75% across different job categories.
Low GrooveSquid.com (original content) Low Difficulty Summary
The Gemini 1.5 family of models is a new way for computers to understand and work with lots of information from different sources, like documents and videos. These models are special because they can remember and use this information to help people complete tasks more efficiently. The updates include two new models: Gemini 1.5 Pro, which does better than its previous version on most things, and Gemini 1.5 Flash, a lighter version that is still very good but uses less computer power. These models are really good at finding the right information in big documents and videos, and they even do better than other models at tasks like answering questions about what’s happening in these files.

Keywords

* Artificial intelligence  * Gemini  * Recall  * Regression  * Token