Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here.
Notes: Could derive this in terms of Whisper v1, which according to the paper was trained for 680k hours for between 2-3 epochs. Whisper v3 was trained on 5 million hours for 2 epochs, or ~5-7x as much data, and has the same architecture. We have an estimate of 4.65e22 for Whisper 1. Assume Whisper v1 was trained on 2.5 epochs, or 2.5*680k = 1.7M hours. Whisper v3 was trained on 10M hours. 10/1.7 * 4.65e22 ~= 2.7e23
Size Notes: English audio is roughly 228 wpm: https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.sxcem9l5k3ce The dataset is multilingual and other languages seem to have lower wpms. So using 200 wpm, we have 200*60*5 million hours = 60,000,000,000 (60B) words