Loading Now

Summary of Hvm-1: Large-scale Video Models Pretrained with Nearly 5000 Hours Of Human-like Video Data, by A. Emin Orhan


HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

by A. Emin Orhan

First submitted to arxiv on: 25 Jul 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces Human-like Video Models (HVM-1), large-scale video models trained using nearly 5000 hours of curated human-like video data. The HVM-1 models are pretrained with the spatiotemporal masked autoencoder (ST-MAE) algorithm and released in two variants, each with a different spatial resolution. The authors evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). Despite differences between the pretraining datasets, HVM-1 models perform competitively against the Kinetics-700 model. The study also shows that HVM-1 models learn more accurate and robust object representations compared to image-based MAE algorithm-trained models on the same data.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a new kind of AI model that can understand videos better. They trained this model using thousands of hours of video recordings, most of which show people doing everyday things like walking or eating. The authors tested this model and compared it to another model that was trained using shorter action-oriented videos from YouTube. Surprisingly, their model performs just as well despite the big difference in what they were shown. This new model is better at understanding objects in videos and can learn more quickly than other models that only look at still images.

Keywords

» Artificial intelligence  » Autoencoder  » Few shot  » Mae  » Pretraining  » Spatiotemporal