Instructions to use Zigeng/DMax-Math-16B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Zigeng/DMax-Math-16B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Zigeng/DMax-Math-16B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Zigeng/DMax-Math-16B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Zigeng/DMax-Math-16B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Zigeng/DMax-Math-16B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zigeng/DMax-Math-16B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Zigeng/DMax-Math-16B

SGLang

How to use Zigeng/DMax-Math-16B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Zigeng/DMax-Math-16B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zigeng/DMax-Math-16B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Zigeng/DMax-Math-16B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zigeng/DMax-Math-16B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Zigeng/DMax-Math-16B with Docker Model Runner:
```
docker model run hf.co/Zigeng/DMax-Math-16B
```

🚀 DMax: Aggressive Parallel Decoding for dLLMs

This repository contains the weights for DMax-Math-16B, presented in the paper DMax: Aggressive Parallel Decoding for dLLMs.

DMax is a new paradigm for efficient diffusion language models (dLLMs) that mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality.

💪 Highlights

Aggressive Decoding Parallelism: Achieves 6.0 TPF on math and reasoning tasks and 6.6 TPF on code tasks while preserving accuracy.
Self-Revising dLLM: Extends a pretrained MDLM into a UDLM with an intrinsic ability to revise its own erroneous predictions during decoding.
Soft Parallel Decoding: Uses interpolation between mask and token embeddings to propagate confidence priors from previous steps.

Superior Parallelism-Accuracy Trade-off, Increased TPF with Maintained Accuracy.

💻 Model and Datasets

Model	Description	Source Model	Link
🤖 DMax-Math-16B	Highly parallel dLLM for math and reasoning.	LLaDA-2.0-mini	HF
🤖 DMax-Coder-16B	Highly parallel dLLM for code generation.	LLaDA-2.0-mini	HF

Dataset	Description	Link
📊 DMax-Math-Training-Data	math trajectories generated by LLaDA-2.0-mini	HF
📊 DMax-Code-Training-Data	code trajectories generated by LLaDA-2.0-mini	HF

🚀 Quick Start

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Zigeng/DMax-Math-16B", trust_remote_code=True, device_map="cuda:0"
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("Zigeng/DMax-Math-16B", trust_remote_code=True)

prompt = "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" + "
Let's think step by step
"

input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
)

nfe, generated_tokens = model.generate_spd(
    inputs=input_ids,
    gen_length=2048,
    block_length=32,
    threshold=0.5,
)

generated_answer = tokenizer.decode(
    generated_tokens[0],
    skip_special_tokens=True,
)

print(generated_answer)
print("nfe:",nfe,"token length",len(generated_tokens[0]))

📖 Experimental Results

$trade-off$

📚 Citation

@article{chen2026dmax,
  title={DMax: Aggressive Parallel Decoding for dLLMs},
  author={Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Yu, Ruonan and Wang, Xinchao},
  journal={arXiv preprint arXiv:2604.08302},
  year={2026}
}