"Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs."
Official research release for CodeT5 and CodeT5+ models for Code Understanding and Generation from Salesforce Research, which are introduced by the following papers:
Title: CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Authors: Yue Wang*, Hung Le*, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi (* indicates equal contribution)
Authors: Yue Wang, Weishi Wang , Shafiq Joty, Steven C.H. Hoi
In practice, CodeT5 and CodeT5+ models can be deployed as an AI-powered coding assistant to boost the productivity of software developers. At Salesforce, we build an AI coding assistant demo using CodeT5 as a VS Code plugin to provide three capabilities:

May 2023
CodeT5+ paper and models are released!🔥
paper | code | model | blog
Sep 2022
Our CodeRL paper has been accepted to NeurIPS 2022!
paper | code | blog
July 2022
We release two large-sized CodeT5 checkpoints at HuggingFace: Salesforce/codet5-large and Salesforce/codet5-large-ntp-py, which are introduced by the CodeRL paper.
Oct 2021
We release fine-tuned checkpoints for all the downstream tasks covered in the paper. Besides, we release a CodeT5-base fine-tuned checkpoint (Salesforce/codet5-base-multi-sum) for multilingual code summarization.
Sep, 2021
CodeT5 paper accepted to EMNLP 2021 and models are released!
paper | code | model | model card | blog
If you find this code to be useful for your research, please consider citing:
@inproceedings{
wang2021codet5,
title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
booktitle={EMNLP},
year={2021},
}
@inproceedings{
le2022coderl,
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
booktitle={NeurIPS},
year={2022}
}
@article{
wang2023codet5plus,
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
journal={arXiv preprint},
year={2023}
}
The code is released under the BSD-3 License (see LICENSE.txt for details), but we also ask that users respect the
following:
This software should not be used to promote or profit from:
violence, hate, and division,
environmental destruction,
abuse of human rights, or
the destruction of people's physical and mental health.
We encourage users of this software to tell us about the applications in which they are putting it to use by emailing codeT5@salesforce.com, and to use appropriate documentation when developing high-stakes applications of this model.
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
Size Notes: "We use the CodeT5 tokenizer to tokenize the multilingual dataset, resulting in 51.5B tokens"
Notes: "We implemented a family of CodeT5+ models, with model sizes ranging from 220M to 16B"