Summary of Efficient Long Video Tokenization Via Coordinate-based Patch Reconstruction, by Huiwon Jang et al.

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

by Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

First submitted to arxiv on: 22 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper introduces CoordTok, a novel video tokenizer that learns to encode long videos into factorized triplane representations and reconstruct patches based on randomly sampled coordinates. This approach allows for training large tokenizer models directly on long videos without excessive resource requirements. The authors demonstrate the effectiveness of CoordTok by reducing the number of tokens required to encode a 128-frame video with 128×128 resolution from 6144 or 8192 tokens (required by baselines) to just 1280 tokens while maintaining similar reconstruction quality. This efficient video tokenization enables memory-efficient training of diffusion transformers that can generate long videos.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Imagine trying to summarize a really long movie into just a few words. That’s what this paper is about – finding a way to do this efficiently. They introduce a new tool called CoordTok that helps us understand and work with long video clips by breaking them down into smaller parts. This makes it possible to train computers to analyze these videos without needing too many resources or memory. The authors show that their approach can greatly reduce the amount of information needed to describe a long video, which is useful for things like generating new videos.

Keywords

» Artificial intelligence » Diffusion » Tokenization » Tokenizer

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

by Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Point Cloud Understanding Via Attention-driven Contrastive Learning, by Yi Wang et al.

Summary of Iterative Reweighted Framework Based Algorithms For Sparse Linear Regression with Generalized Elastic Net Penalty, by Yanyun Ding et al.

Related Posts