AlloGen / inference.md

AlloGen public release: Q_theta scorer + PXDesign guidance + Colab demo

ad9572d 4 days ago

3.69 kB

	# AlloGen Inference Guide

	This guide covers how to score binder designs and apply guidance with the bundled Q_θ checkpoint. Training is not part of the public release — only inference and guidance.

	> Env var. Throughout this doc, `${ALLOGEN_ROOT}` is the path to the cloned repo. Either `cd` into it and use relative paths, or `export ALLOGEN_ROOT=/path/to/AlloGen`.

	> Python. Use the env from `environment.yml` / `requirements.txt`. All scripts insert `code/` into `sys.path` via a `_CODE_DIR` boot block, so they work from any CWD.

	---

	## 1. Checkpoint

	The Phase 2 weights `checkpoints/Q_theta_phase2.pt` are the v4-S2 target-swap split model used in the paper. Phase 1 (`Q_theta_phase1.pt`) is the DockQ regression intermediate.

	Pull via Git LFS:

	```bash
	git lfs install
	git lfs pull
	```

	---

	## 2. Score binders

	### 2a. Python API

	```python
	import sys
	sys.path.insert(0, 'code')

	from models.differentiable_features import DifferentiableQTheta

	scorer = DifferentiableQTheta(
	checkpoint='checkpoints/Q_theta_phase2.pt',
	device='cuda:0',
	)
	scorer.load_receptor(
	holo_path='holo.pdb', rec_chain='A',
	apo_path='apo.pdb', apo_chain='A',
	)
	q_holo = scorer.score('design.pdb', binder_chain='B', state='holo')
	q_apo = scorer.score('design.pdb', binder_chain='B', state='apo')
	print(f'S = {q_holo - q_apo:.3f}')
	```

	### 2b. CLI on the bundled sample

	```bash
	python code/scripts/evaluate.py \
	--target cam \
	--checkpoint checkpoints/Q_theta_phase2.pt \
	--data_dir data/sample/ \
	--outdir /tmp/cam_inference \
	--no_wandb
	```

	Scores every binder in `data/sample/cam/test.pkl` and writes `tables/eval_cam_test.json` with Spearman ρ, AUC, and selectivity gap.

	---

	## 3. Guidance methods (PXDesign)

	The shipped guidance code wraps PXDesign as the prior and uses Q_θ as the gradient / classifier signal.

	\| Script \| Method \|
	\|---\|---\|
	\| `code/scripts/pxdesign_guidance/langevin_pxdesign.py` \| Post-hoc Langevin refinement \|
	\| `code/scripts/pxdesign_guidance/smc_pxdesign.py` \| Sequential Monte Carlo \|
	\| `code/scripts/pxdesign_guidance/tds_pxdesign.py` \| Twisted Diffusion Sampler \|
	\| `code/scripts/pxdesign_guidance/guided_pxdesign.py` \| Classifier guidance \|
	\| `code/scripts/pxdesign_guidance/iterative_refinement.py` \| Iterative refinement loop \|
	\| `code/scripts/pxdesign_guidance/qtheta_pxdesign.py` \| Q_θ wrapper used by the above \|

	Common flags:

	- `--checkpoint checkpoints/Q_theta_phase2.pt`
	- `--holo_pdb your_holo.pdb` / `--apo_pdb your_apo.pdb`
	- `--output_dir designs/`
	- `--device cuda:0`
	- `--seed 42`

	Method-specific arguments (steps, batch sizes, guidance scales) are in each script's `argparse` block.

	To plug Q_θ into RFdiffusion, Proteina-ComplexA, or any other backbone prior, see `code/scripts/README.md`.

	---

	## 4. Bundled sample data

	`data/sample/cam/test.pkl` — held-out test split for Calmodulin (CaM), small enough to run on a laptop CPU in under a minute. The only data shipped in the repo. Score your own targets via the Python API in §2a (raw PDBs as input).

	---

	## 5. Training reproduction

	Training data, training scripts, and per-target processed graphs are NOT shipped in this public release. The paper's main result (Phase 2 on the v4-S2 target-swap split) is provided as a frozen checkpoint at `checkpoints/Q_theta_phase2.pt`. Retraining requires the full pipeline (separate request).

	---

	## 6. Citation

	```bibtex
	@inproceedings{cao2026allogen,
	title = {AlloGen: State-Selective Scoring for Allosteric Binder Design},
	author = {Cao, Hanqun and others},
	booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
	year = {2026}
	}
	```