Model Card for ProtGPT3-1OB

Model Details

Model Description

ProtGPT3-1OB is a single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models ranging from 112M to 10B parameters. ProtGPT3 models use a causal Mixtral-style Mixture-of-Experts architecture and are trained for causal language modeling on protein sequences.

The single-sequence ProtGPT3 models can generate proteins in either N-to-C or C-to-N direction using special directional tokens. The model is intended for unconditional or prefix-conditioned protein sequence generation and can be used as a base model for downstream protein design workflows.

Uses

Direct Use

ProtGPT3-1OB can be used for autoregressive generation of protein sequences. Users can generate sequences unconditionally or condition generation on an amino-acid prefix.

Downstream Use

The model may be fine-tuned or incorporated into protein design workflows, including family-specific generation, protein variant generation, and computational screening pipelines.

Out-of-Scope Use

The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated proteins require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, or synthesizable proteins.

Bias, Risks, and Limitations

ProtGPT3-1OB learns from public protein sequence datasets and may reproduce biases present in those datasets. Generated sequences may be low-complexity, nonfunctional, unstable, insoluble, or biologically implausible. Protein generation models may also present dual-use risks if used irresponsibly.

Recommendations

Users should apply appropriate computational filters, expert review, and experimental validation before using generated sequences. Users should also consider responsible-use practices for generative protein design.

How to Get Started with the Model

Install dependencies:

pip install transformers accelerate torch

Load the model and tokenizer:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "protgpt3/ProtGPT3-1OB"  # Replace with the final checkpoint name

# Load tokenizer for generation
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()

Generate a protein sequence

import torch

prompt = ""  # Optionally provide an amino-acid prefix or model-specific direction

inputs = tokenizer(prompt, return_tensors="pt", padding_side="left").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N

Generate from an amino-acid prefix

import torch

# forward N-to-C generation with special token "1" 
prefix = "1MKT" # use special token "2" instead of "1" for reverse  C-to-N generation

inputs = tokenizer(prefix, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(sequence)

Batch generation

import torch

prompts = [
    "",
    "1MKT", # N-to-C generation
    "2MAV", # C-to-N generation
]

inputs = tokenizer(
    prompts,
    return_tensors="pt",
    padding=True,
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=256,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.bos_token_id,
    )

sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

for sequence in sequences:
    print(sequence)

Model Architecture and Objective

ProtGPT3-1OB is a decoder-only causal language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained with a causal language modeling objective on protein sequences.

Software

Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed.

Citation

BibTeX:

@article{protgpt3,
  title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models},
  author={Anonymous Authors},
  year={2026}
}

More Information

All models and code are released through the Hugging Face ecosystem and accompanying code repository.

Downloads last month: 31

Safetensors

Model size

10B params

Tensor type

BF16

Collection including AI4PD/ProtGPT3-10B

ProtGPT3 Family

Collection

7 items • Updated about 2 hours ago • 3