AlloGen / inference.md
chq1155's picture
AlloGen public release: Q_theta scorer + PXDesign guidance + Colab demo
ad9572d
# AlloGen Inference Guide
This guide covers how to score binder designs and apply guidance with the bundled Q_θ checkpoint. Training is not part of the public release — only inference and guidance.
> **Env var.** Throughout this doc, `${ALLOGEN_ROOT}` is the path to the cloned repo. Either `cd` into it and use relative paths, or `export ALLOGEN_ROOT=/path/to/AlloGen`.
> **Python.** Use the env from `environment.yml` / `requirements.txt`. All scripts insert `code/` into `sys.path` via a `_CODE_DIR` boot block, so they work from any CWD.
---
## 1. Checkpoint
The Phase 2 weights `checkpoints/Q_theta_phase2.pt` are the **v4-S2 target-swap split** model used in the paper. Phase 1 (`Q_theta_phase1.pt`) is the DockQ regression intermediate.
Pull via Git LFS:
```bash
git lfs install
git lfs pull
```
---
## 2. Score binders
### 2a. Python API
```python
import sys
sys.path.insert(0, 'code')
from models.differentiable_features import DifferentiableQTheta
scorer = DifferentiableQTheta(
checkpoint='checkpoints/Q_theta_phase2.pt',
device='cuda:0',
)
scorer.load_receptor(
holo_path='holo.pdb', rec_chain='A',
apo_path='apo.pdb', apo_chain='A',
)
q_holo = scorer.score('design.pdb', binder_chain='B', state='holo')
q_apo = scorer.score('design.pdb', binder_chain='B', state='apo')
print(f'S = {q_holo - q_apo:.3f}')
```
### 2b. CLI on the bundled sample
```bash
python code/scripts/evaluate.py \
--target cam \
--checkpoint checkpoints/Q_theta_phase2.pt \
--data_dir data/sample/ \
--outdir /tmp/cam_inference \
--no_wandb
```
Scores every binder in `data/sample/cam/test.pkl` and writes `tables/eval_cam_test.json` with Spearman ρ, AUC, and selectivity gap.
---
## 3. Guidance methods (PXDesign)
The shipped guidance code wraps **PXDesign** as the prior and uses Q_θ as the gradient / classifier signal.
| Script | Method |
|---|---|
| `code/scripts/pxdesign_guidance/langevin_pxdesign.py` | Post-hoc Langevin refinement |
| `code/scripts/pxdesign_guidance/smc_pxdesign.py` | Sequential Monte Carlo |
| `code/scripts/pxdesign_guidance/tds_pxdesign.py` | Twisted Diffusion Sampler |
| `code/scripts/pxdesign_guidance/guided_pxdesign.py` | Classifier guidance |
| `code/scripts/pxdesign_guidance/iterative_refinement.py` | Iterative refinement loop |
| `code/scripts/pxdesign_guidance/qtheta_pxdesign.py` | Q_θ wrapper used by the above |
Common flags:
- `--checkpoint checkpoints/Q_theta_phase2.pt`
- `--holo_pdb your_holo.pdb` / `--apo_pdb your_apo.pdb`
- `--output_dir designs/`
- `--device cuda:0`
- `--seed 42`
Method-specific arguments (steps, batch sizes, guidance scales) are in each script's `argparse` block.
To plug Q_θ into RFdiffusion, Proteina-ComplexA, or any other backbone prior, see `code/scripts/README.md`.
---
## 4. Bundled sample data
`data/sample/cam/test.pkl` — held-out test split for Calmodulin (CaM), small enough to run on a laptop CPU in under a minute. **The only data shipped in the repo.** Score your own targets via the Python API in §2a (raw PDBs as input).
---
## 5. Training reproduction
Training data, training scripts, and per-target processed graphs are NOT shipped in this public release. The paper's main result (Phase 2 on the **v4-S2 target-swap** split) is provided as a frozen checkpoint at `checkpoints/Q_theta_phase2.pt`. Retraining requires the full pipeline (separate request).
---
## 6. Citation
```bibtex
@inproceedings{cao2026allogen,
title = {AlloGen: State-Selective Scoring for Allosteric Binder Design},
author = {Cao, Hanqun and others},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2026}
}
```