Loading Now

Summary of Landmark-guided Cross-speaker Lip Reading with Mutual Information Regularization, by Linzhi Wu et al.


Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

by Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin

First submitted to arxiv on: 24 Mar 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this paper, researchers aim to improve deep learning-based lip reading systems by developing a model that can accurately recognize silent speech across different speakers. The challenge lies in handling inter-speaker variability, where a well-trained system may struggle with a new speaker. To overcome this issue, the authors propose a hybrid architecture combining CTC and attention mechanisms, which leverages fine-grained visual clues from lip landmarks instead of traditional mouth-cropped images. Additionally, they introduce a max-min mutual information regularization approach to capture speaker-insensitive latent representations. Experimental results on public datasets demonstrate the effectiveness of their proposed method in both intra-speaker and inter-speaker settings.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making it easier for computers to read lips, like reading speech from seeing someone’s mouth move without making a sound. Right now, computers are good at this task but can struggle when they encounter a new person. The researchers want to solve this problem by creating a better model that can recognize silent speech even if the speaker is different. They think that by using more detailed visual clues from the lips and ignoring some differences between people’s faces, their computer will be able to read lips more accurately. This is important because it could help people who are deaf or hard of hearing communicate more easily.

Keywords

* Artificial intelligence  * Attention  * Deep learning  * Regularization