Loading Now

Summary of Revealing Vision-language Integration in the Brain with Multimodal Networks, by Vighnesh Subramaniam et al.


Revealing Vision-Language Integration in the Brain with Multimodal Networks

by Vighnesh Subramaniam, Colin Conwell, Christopher Wang, Gabriel Kreiman, Boris Katz, Ignacio Cases, Andrei Barbu

First submitted to arxiv on: 20 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper proposes a novel approach to investigating multimodal integration in the human brain using deep neural networks (DNNs). The authors use DNNs to predict stereoencephalography (SEEG) recordings taken while subjects watch movies, operationalizing sites of multimodal integration as regions where the DNN model predicts recordings better than unimodal language, vision, or linearly-integrated language-vision models. The paper explores different architectures and training techniques for the DNN models, including convolutional networks and transformers, cross-attention, and contrastive learning. The authors first demonstrate that trained vision and language models outperform their randomly initialized counterparts in predicting SEEG signals. They then compare unimodal and multimodal models against each other, finding a sizable number of neural sites (12.94%) where multimodal integration occurs. Among the variants of multimodal training techniques assessed, CLIP-style training is found to be the best suited for downstream prediction of neural activity in these sites.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study uses special computers called deep neural networks to figure out how our brains work when we see and hear things at the same time. The researchers recorded brain activity while people watched movies and then used the computer models to predict what was happening in their brains. They found that some parts of the brain are really good at combining visual and auditory information, and they identified which parts those were. They also discovered that one type of training for these computers works better than others when trying to understand brain activity.

Keywords

» Artificial intelligence  » Cross attention