We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at this https URL.
Notes: 6 FLOP/parameter/token * 500000000 parameters * 39321600000 tokens = 117964800000000000000 FLOP
Size Notes: # max tokens 4096 update frequency 4 8 GPUs max updates 300K 4096 * 4 * 8 * 300000 = 39321600000 tokens
Notes: 0.5B 12-layer encoder + 2-layer decoder, d=512/2048