GLM-5 REAP-50% Dynamic IQ2_M GGUF (2.72 BPW, ~121 GB)

Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) quantized to dynamic IQ2_M (2.72 BPW) using importance-matrix calibration. Fits in ~125-130 GB VRAM with KV cache room.

Benchmark Results (Pilot, 10 samples/category)

Category	Q3_K_M (170GB)	UD-IQ2_M (121GB)	UD-IQ2_XXS (97GB)
Math (GSM8K)	8/10 (80%)	6/10 (60%)	2/10 (20%)
Reasoning (BBH)	8/10 (80%)	7/10 (70%)	4/10 (40%)
Coding (HumanEval)	9/10 (90%)	8/10 (80%)	7/10 (70%)
Agentic (SWE-bench)	10/10 (100%)	10/10 (100%)	10/10 (100%)
Terminal-bench	9/10 (90%)	9/10 (90%)	10/10 (100%)
Overall	44/50 (88%)	40/50 (80%)	33/50 (66%)

Fidelity to Q3_K_M: 91% (40/44). Math and reasoning are most affected by quantization.

Model Details

Property	Value
Base model	zai-org/GLM-5 (744B, 256 routed experts)
Pruning	REAP saliency pruning, 50% expert removal (256 -> 128 experts)
Quantization	Dynamic IQ2_M with imatrix (2.72 BPW)
Size	~121 GB
Architecture	GlmMoeDsaForCausalLM (MLA + MoE + DSA)
Context	202,752 tokens
Active params	~20B per token (8 of 128 experts)

Dynamic Quantization Strategy

Component	Quant Type	Rationale
`output.weight`	Q5_K	Critical for logit quality
`token_embd.weight`	Q4_K	Important for input representation
Dense FFN (first 3 layers)	Q5_K	No MoE; all traffic flows through
MLA attention projections	Q4_K	Coherence-critical
MLA key projection (`attn_k_b`)	Q5_K	Higher precision for keys
DSA indexer (`indexer.*`)	Q5_K	Critical for attention routing
Shared experts (`ffn_*_shexp`)	Q4_K-Q5_K	Always active
Last MoE layer (blk.77)	Q3_K	Final representation
Routed MoE experts (bulk)	IQ2_M	Main savings; imatrix-calibrated

Usage

huggingface-cli download 0xSero/GLM-5-REAP-50pct-UD-IQ2_M-GGUF --local-dir ./model

./llama-server \
    --model ./model/GLM-5-REAP-50pct-UD-IQ2_M.gguf \
    --ctx-size 8192 \
    --n-gpu-layers 99 \
    --port 8080 \
    --reasoning-budget 2048

Requires ~125 GB VRAM (model + KV cache at 8K context). Fits on 2x H100 80GB or 1x B200 192GB.

All Variants

Variant	BPW	Size	Parse Rate	Repo
BF16	16.00	711 GB	N/A	BF16-GGUF
Q3_K_M	3.82	170 GB	88%	Q3_K_M-GGUF
UD-IQ2_M (this)	2.72	121 GB	80%	this repo
UD-IQ2_XXS	2.19	97 GB	66%	UD-IQ2_XXS-GGUF

Downloads last month: 449

GGUF

Model size

381B params

Architecture

glm-dsa

Hardware compatibility

2-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5-REAP-50pct-UD-IQ2_M-GGUF

Base model

zai-org/GLM-5

Quantized

(25)

this model