Loading Now

Summary of Efficient Long Video Tokenization Via Coordinate-based Patch Reconstruction, by Huiwon Jang et al.


Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

by Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

First submitted to arxiv on: 22 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces CoordTok, a novel video tokenizer that learns to encode long videos into factorized triplane representations and reconstruct patches based on randomly sampled coordinates. This approach allows for training large tokenizer models directly on long videos without excessive resource requirements. The authors demonstrate the effectiveness of CoordTok by reducing the number of tokens required to encode a 128-frame video with 128×128 resolution from 6144 or 8192 tokens (required by baselines) to just 1280 tokens while maintaining similar reconstruction quality. This efficient video tokenization enables memory-efficient training of diffusion transformers that can generate long videos.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine trying to summarize a really long movie into just a few words. That’s what this paper is about – finding a way to do this efficiently. They introduce a new tool called CoordTok that helps us understand and work with long video clips by breaking them down into smaller parts. This makes it possible to train computers to analyze these videos without needing too many resources or memory. The authors show that their approach can greatly reduce the amount of information needed to describe a long video, which is useful for things like generating new videos.

Keywords

» Artificial intelligence  » Diffusion  » Tokenization  » Tokenizer