Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here. Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance.
FLOPs1.1e+23
Notes: "Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance." We (roughly) estimated Whisper v1 as 4.65e22. 2.5x that is 1.16e23 or ~1.1e23
Training Code AccessibilityApache 2.0 for weights code for v1 is MIT: https://github.com/openai/whisper
Size Notes: "When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning." 13,680 words/h (estimate) * 680,000h = 9,302,400,000 words
Parameters1550000000
Notes: 1550M