codet5-base

Maintained By
Salesforce

CodeT5-base

PropertyValue
AuthorSalesforce
LicenseApache 2.0
PaperLink to Paper
Training DataCodeSearchNet + C/CSharp from BigQuery

What is codet5-base?

CodeT5-base is a unified pre-trained encoder-decoder Transformer model specifically designed for code understanding and generation tasks. Developed by Salesforce, it leverages code semantics through developer-assigned identifiers and employs a novel identifier-aware pre-training approach.

Implementation Details

The model utilizes a code-specific BPE tokenizer and can be easily implemented using the HuggingFace Transformers library. It was trained on approximately 8.35 million code instances and supports both understanding and generation tasks through a unified framework.

  • Built on T5 architecture with code-specific modifications
  • Uses RobertaTokenizer for preprocessing
  • Supports masked span prediction and conditional generation

Core Capabilities

  • Code summarization and generation
  • Code translation between programming languages
  • Code refinement and optimization
  • Defect detection in code
  • Clone detection capabilities
  • Natural language to code generation

Frequently Asked Questions

Q: What makes this model unique?

CodeT5-base's uniqueness lies in its identifier-aware pre-training task and bimodal dual generation capability, allowing it to better understand and process code semantics while maintaining strong natural language alignment.

Q: What are the recommended use cases?

The model excels in various code-related tasks including code summarization, generation, translation, refinement, and defect detection. It's particularly suitable for developers and researchers working on automated code analysis and generation tools.

The first platform built for prompt engineering