SOTA Performance: Wan2.1 consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks. Multiple Tasks: Wan2.1 excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation. Visual Text Generation: Wan2.1 is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications. Powerful Video VAE: Wan-VAE delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
Notes: "Through extensive experimentation, the model is validated at scale, reaching 14 billion parameters. Subsequently, Wan has seen large-scale data comprising billions of images and videos, amounting to O(1) trillions of tokens in total." So likely between 1T and 10T tokens. Assume 3T. Transformer architecture, so 6ND should be a decent approximation. 6ND = 6 * 14e9 * 3e12 ~= 2.5e+23 FLOP
Size Notes: "Wan has seen large-scale data comprising billions of images and videos, amounting to O(1) trillions of tokens in total." with "Likely" confidence, assuming ~3 trillion
Notes: 14B