GLM-5 REAP-50% Dynamic IQ2_M GGUF (2.72 BPW, ~121 GB)
Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) quantized to dynamic IQ2_M (2.72 BPW) using importance-matrix calibration. Fits in ~125-130 GB VRAM with KV cache room.
Benchmark Results (Pilot, 10 samples/category)
| Category | Q3_K_M (170GB) | UD-IQ2_M (121GB) | UD-IQ2_XXS (97GB) |
|---|---|---|---|
| Math (GSM8K) | 8/10 (80%) | 6/10 (60%) | 2/10 (20%) |
| Reasoning (BBH) | 8/10 (80%) | 7/10 (70%) | 4/10 (40%) |
| Coding (HumanEval) | 9/10 (90%) | 8/10 (80%) | 7/10 (70%) |
| Agentic (SWE-bench) | 10/10 (100%) | 10/10 (100%) | 10/10 (100%) |
| Terminal-bench | 9/10 (90%) | 9/10 (90%) | 10/10 (100%) |
| Overall | 44/50 (88%) | 40/50 (80%) | 33/50 (66%) |
Fidelity to Q3_K_M: 91% (40/44). Math and reasoning are most affected by quantization.
Model Details
| Property | Value |
|---|---|
| Base model | zai-org/GLM-5 (744B, 256 routed experts) |
| Pruning | REAP saliency pruning, 50% expert removal (256 -> 128 experts) |
| Quantization | Dynamic IQ2_M with imatrix (2.72 BPW) |
| Size | ~121 GB |
| Architecture | GlmMoeDsaForCausalLM (MLA + MoE + DSA) |
| Context | 202,752 tokens |
| Active params | ~20B per token (8 of 128 experts) |
Dynamic Quantization Strategy
| Component | Quant Type | Rationale |
|---|---|---|
output.weight |
Q5_K | Critical for logit quality |
token_embd.weight |
Q4_K | Important for input representation |
| Dense FFN (first 3 layers) | Q5_K | No MoE; all traffic flows through |
| MLA attention projections | Q4_K | Coherence-critical |
MLA key projection (attn_k_b) |
Q5_K | Higher precision for keys |
DSA indexer (indexer.*) |
Q5_K | Critical for attention routing |
Shared experts (ffn_*_shexp) |
Q4_K-Q5_K | Always active |
| Last MoE layer (blk.77) | Q3_K | Final representation |
| Routed MoE experts (bulk) | IQ2_M | Main savings; imatrix-calibrated |
Usage
huggingface-cli download 0xSero/GLM-5-REAP-50pct-UD-IQ2_M-GGUF --local-dir ./model
./llama-server \
--model ./model/GLM-5-REAP-50pct-UD-IQ2_M.gguf \
--ctx-size 8192 \
--n-gpu-layers 99 \
--port 8080 \
--reasoning-budget 2048
Requires ~125 GB VRAM (model + KV cache at 8K context). Fits on 2x H100 80GB or 1x B200 192GB.
All Variants
| Variant | BPW | Size | Parse Rate | Repo |
|---|---|---|---|---|
| BF16 | 16.00 | 711 GB | N/A | BF16-GGUF |
| Q3_K_M | 3.82 | 170 GB | 88% | Q3_K_M-GGUF |
| UD-IQ2_M (this) | 2.72 | 121 GB | 80% | this repo |
| UD-IQ2_XXS | 2.19 | 97 GB | 66% | UD-IQ2_XXS-GGUF |
- Downloads last month
- 449
Hardware compatibility
Log In to add your hardware
2-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for 0xSero/GLM-5-REAP-50pct-UD-IQ2_M-GGUF
Base model
zai-org/GLM-5