ProtGPT2

Maintained By
nferruz

ProtGPT2

PropertyValue
Parameters738 million
ArchitectureGPT2 Transformer (36 layers, 1280 dim)
LicenseApache 2.0
PaperNature Communications
Training DataUniRef50 (2021_04)

What is ProtGPT2?

ProtGPT2 is a groundbreaking language model designed specifically for protein sequence generation and engineering. Built on the GPT2 architecture, it has been trained to understand and generate protein sequences that maintain natural protein characteristics while exploring novel protein space. The model was trained on the UniRef50 database in a self-supervised manner, learning to predict the next amino acid in sequences without requiring explicit annotations.

Implementation Details

The model leverages a sophisticated architecture with 36 transformer layers and a dimensionality of 1280, totaling 738 million parameters. Training was conducted on 128 NVIDIA A100 GPUs for 50 epochs, using a block size of 512 and a batch size of 1024. The model employs the Adam optimizer with carefully tuned hyperparameters for optimal performance.

  • Specialized tokenization for protein sequences
  • Causal language modeling objective
  • Zero-shot generation capabilities
  • Fine-tuning support for specific protein families

Core Capabilities

  • De novo protein sequence generation
  • Protein engineering and optimization
  • Conservation of natural protein features
  • Perplexity-based sequence quality assessment
  • Integration with popular protein analysis pipelines

Frequently Asked Questions

Q: What makes this model unique?

ProtGPT2 stands out for its ability to generate protein sequences that maintain natural protein characteristics while exploring previously unseen regions of the protein space. It combines the power of transformer architecture with specialized training on protein sequences, making it a powerful tool for protein engineering and design.

Q: What are the recommended use cases?

The model excels in de novo protein design, protein engineering, and generating protein sequence variants. It can be used either in zero-shot mode for general protein generation or fine-tuned on specific protein families for targeted design tasks. The model's perplexity scores can be used to evaluate the quality of generated sequences.

The first platform built for prompt engineering