Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here. Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance.
Notes: "Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance." We (roughly) estimated Whisper v1 as 4.65e22. 2.5x that is 1.16e23 or ~1.1e23
Size Notes: "When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning." 13,680 words/h (estimate) * 680,000h = 9,302,400,000 words
Notes: 1550M