We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at this https URL. The online version can be accessed from this https URL as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
Notes: "We have constructed a datacenter comprising thousands of NVIDIA H800 GPUs" "we have achieved 99% effective GPU training time over more than one month." 989000000000000 FLOP / GPU / sec [bf16 assumed] * 720 hours * 3600 sec / hour * 5000 GPUs [assumption -> "Likely" confidence] * 0.32 [reported utilization] = 4.1015808e+24 FLOP
Size Notes: "We constructed a large-scale video dataset comprising 2B video-text pairs and 3.8B image-text pairs" they are supposedly using only a subset of it (see Table 6): 3.8B image-text pairs 644M low resolution video-text pairs +Post-filtering SFT dataset: 30M high-quality video-text pairs
Notes: 30B