A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Notes: The number of tokens per video seems to be ~2048: 16x16 patches on 256x256 videos = 256 patches per frame, 2 tubelets across 16 frames = 8 temporal dimensions (not sure about this). Total tokens per clip: 8*16*16=2048 Batch size = 3072 Iterations = 240k ("Our training process begins with a warmup phase where we train on 16-frame, 256 × 256-resolution videos with linear learning rate warmup over 12K iterations, followed by a main training phase with a constant learning rate for 228K iterations.") Total pretraining compute using 6ND: 6*1e9 parameters *240000 steps * 3072 batch * 2048 tokens = 9059696640000000000000
Size Notes: "In the first stage—pre-training—we use more than 1 million hours of video and 1 million images from diverse sources. " "The resulting dataset, which we refer to as VideoMix22M (or VM22M), consists of 22 million samples. Table 1 lists these data sources and their weights" [next stage] "only 62 hours of robot data" "we first patchify it as a sequence of tubelets of size 2 × 16 × 16 " "252 thousand iterations" The number of tokens per video seems to be ~2048: 16x16 patches on 256x256 videos = 256 patches per frame, 2 tubelets across 16 frames = 8 temporal dimensions (not sure about this). Total tokens per clip: 8*16*16=2048 Batch size = 3072 Iterations = 240k ("Our training process begins with a warmup phase where we train on 16-frame, 256 × 256-resolution videos with linear learning rate warmup over 12K iterations, followed by a main training phase with a constant learning rate for 228K iterations.") 240000 * 2048 * 3072 = 1.5099494e+12 video tokens
Notes: Table 4