We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
FLOPs3.87e+21
Notes: From surveying the authors: We trained the base model on 64 V100 GPUs for 400k updates. This takes about 3 days to complete. The large model is trained on 128 V100 GPUs for 1 million updates, and this takes about 7 days to complete. V100 GPU peak: 125TFLOP/s (https://www.nvidia.com/en-gb/data-center/tesla-v100/) Assume 40% utilization based on default for non-Language domain (https://epoch.ai/blog/estimating-training-compute) 128 GPUs * 40% * 125TFLOP/s * 7 days * 24h/day * 3600s/h ~= 3.870720e+21
Training Code Accessibilityhttps://github.com/facebookresearch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/examples/wav2vec/README.md fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well. Repo contains weights and pretrain and finetune code
HardwareNVIDIA Tesla V100 DGXS 32 GB
Size Notes: pg 4, section 4.1 "As unlabeled data we consider the Librispeech corpus [40] without transcriptions containing 960 hours of audio (LS-960) or the audio data from LibriVox (LV-60k). For the latter we follow the preprocessing of [27] resulting in 53.2k hours of audio." 53.2k h * 13,680 words/h = 727776000 words
Parameters317000000
Notes: Section 5.1: "We consider two model sizes: BASE (95m parameters) and LARGE (317m parameters)