GLM-5 REAP-50% Dynamic IQ2_M GGUF (2.72 BPW, ~121 GB)

Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) quantized to dynamic IQ2_M (2.72 BPW) using importance-matrix calibration. Fits in ~125-130 GB VRAM with KV cache room.

Benchmark Results (Pilot, 10 samples/category)

Category Q3_K_M (170GB) UD-IQ2_M (121GB) UD-IQ2_XXS (97GB)
Math (GSM8K) 8/10 (80%) 6/10 (60%) 2/10 (20%)
Reasoning (BBH) 8/10 (80%) 7/10 (70%) 4/10 (40%)
Coding (HumanEval) 9/10 (90%) 8/10 (80%) 7/10 (70%)
Agentic (SWE-bench) 10/10 (100%) 10/10 (100%) 10/10 (100%)
Terminal-bench 9/10 (90%) 9/10 (90%) 10/10 (100%)
Overall 44/50 (88%) 40/50 (80%) 33/50 (66%)

Fidelity to Q3_K_M: 91% (40/44). Math and reasoning are most affected by quantization.

Model Details

Property Value
Base model zai-org/GLM-5 (744B, 256 routed experts)
Pruning REAP saliency pruning, 50% expert removal (256 -> 128 experts)
Quantization Dynamic IQ2_M with imatrix (2.72 BPW)
Size ~121 GB
Architecture GlmMoeDsaForCausalLM (MLA + MoE + DSA)
Context 202,752 tokens
Active params ~20B per token (8 of 128 experts)

Dynamic Quantization Strategy

Component Quant Type Rationale
output.weight Q5_K Critical for logit quality
token_embd.weight Q4_K Important for input representation
Dense FFN (first 3 layers) Q5_K No MoE; all traffic flows through
MLA attention projections Q4_K Coherence-critical
MLA key projection (attn_k_b) Q5_K Higher precision for keys
DSA indexer (indexer.*) Q5_K Critical for attention routing
Shared experts (ffn_*_shexp) Q4_K-Q5_K Always active
Last MoE layer (blk.77) Q3_K Final representation
Routed MoE experts (bulk) IQ2_M Main savings; imatrix-calibrated

Usage

huggingface-cli download 0xSero/GLM-5-REAP-50pct-UD-IQ2_M-GGUF --local-dir ./model

./llama-server \
    --model ./model/GLM-5-REAP-50pct-UD-IQ2_M.gguf \
    --ctx-size 8192 \
    --n-gpu-layers 99 \
    --port 8080 \
    --reasoning-budget 2048

Requires ~125 GB VRAM (model + KV cache at 8K context). Fits on 2x H100 80GB or 1x B200 192GB.

All Variants

Variant BPW Size Parse Rate Repo
BF16 16.00 711 GB N/A BF16-GGUF
Q3_K_M 3.82 170 GB 88% Q3_K_M-GGUF
UD-IQ2_M (this) 2.72 121 GB 80% this repo
UD-IQ2_XXS 2.19 97 GB 66% UD-IQ2_XXS-GGUF
Downloads last month
449
GGUF
Model size
381B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for 0xSero/GLM-5-REAP-50pct-UD-IQ2_M-GGUF

Base model

zai-org/GLM-5
Quantized
(25)
this model