Introduction

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

Benchmarking

FLOPs3.87e+21

Notes: From surveying the authors: We trained the base model on 64 V100 GPUs for 400k updates. This takes about 3 days to complete. The large model is trained on 128 V100 GPUs for 1 million updates, and this takes about 7 days to complete. V100 GPU peak: 125TFLOP/s (https://www.nvidia.com/en-gb/data-center/tesla-v100/) Assume 40% utilization based on default for non-Language domain (https://epoch.ai/blog/estimating-training-compute) 128 GPUs * 40% * 125TFLOP/s * 7 days * 24h/day * 3600s/h ~= 3.870720e+21

Training

Training Code Accessibilityhttps://github.com/facebookresearch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/examples/wav2vec/README.md fairseq(-py) is MIT-licensed. The license applies to the pre-trained models as well. Repo contains weights and pretrain and finetune code

HardwareNVIDIA Tesla V100 DGXS 32 GB

Size Notes: pg 4, section 4.1 "As unlabeled data we consider the Librispeech corpus [40] without transcriptions containing 960 hours of audio (LS-960) or the audio data from LibriVox (LV-60k). For the latter we follow the preprocessing of [27] resulting in 53.2k hours of audio." 53.2k h * 13,680 words/h = 727776000 words

Introduction

Benchmarking

FLOPs3.87e+21

Training

HardwareNVIDIA Tesla V100 DGXS 32 GB

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors