Kimi-K2.5-PRISM-REAP-72
WEIGHTS ARE BROKEN! DO NOT USE THIS FOR PRODUCTION SYSTEMS, I AM KEEPING IT FOR RESEARCH SAKE IT IS A 80%~ PRUNE OF AN INT4 QUANTIZED MODEL.
81% REAP expert-pruned version of moonshotai/Kimi-K2.5, further pruned from the PRISM-REAP 192-expert variant. Designed to fit on 8x RTX 3090 (24GB) consumer GPUs.
| Property | Value |
|---|---|
| Architecture | KimiK25 (DeepSeekV3 backbone, MLA attention) |
| Total Parameters | ~200B (down from ~1T) |
| Active Parameters | ~32B (8 experts per token) |
| Experts per MoE Layer | 72 routed + 1 shared (down from 384 + 1) |
| MoE Layers | 60 (layers 1-60, layer 0 is dense) |
| Hidden Size | 7168 |
| Attention | MLA (kv_lora_rank=512, q_lora_rank=1536) |
| Quantization | W4A16 (group_size=32, symmetric) via compressed-tensors |
| Disk Size | 122 GB (down from 289 GB / 555 GB original) |
| Pruning Method | REAP (Router-weighted Expert Activation Pruning) |
| Vision | Supported (inherited from Kimi-K2.5) |
Why 72 Experts?
72 was chosen because:
- Divisible by 8: Clean sharding across 8 GPUs for TP/EP
- ~122 GB total: Fits in 8x 24GB with room for KV cache
- ~15 GB/GPU weight footprint with Expert Parallelism, leaving ~7 GB for KV cache and overhead
- Retains the top 72 most salient experts per layer from the original 384
Performance (8x RTX 3090, 155W, vLLM 0.15.1)
| Metric | Value |
|---|---|
| Single request | 33.4 tok/s |
| 2 concurrent | 52.5 tok/s |
| 4 concurrent | 86.2 tok/s |
| 8 concurrent | 145.5 tok/s |
| TTFT | 0.08s |
| Max context | 57,344 tokens |
| Vision | Working |
Recommended vLLM Launch (8x RTX 3090)
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve 0xsero/Kimi-K2.5-PRISM-REAP-72 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 57344 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 16 \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--enable-chunked-prefill \
--enable-prefix-caching
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"0xsero/Kimi-K2.5-PRISM-REAP-72",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"0xsero/Kimi-K2.5-PRISM-REAP-72",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Pruning Details
This model was created in a two-stage process:
- Stage 1 (Ex0bit): REAP pruning of original 384 experts to 192 experts using saliency scores from 512 calibration samples on allenai/tulu-3-sft-mixture
- Stage 2 (this model): Further pruning from 192 to 72 experts using the same REAP saliency scores, targeting consumer GPU deployment
Key Technical Details
- Per-layer top-72 selection: The 72 most salient experts retained independently per layer
- Gate weight slicing: Router gate weights
[192, 7168]sliced to[72, 7168],e_score_correction_biasfrom[192]to[72] - Contiguous expert remapping: Expert indices remapped to 0-71 in each layer
- All non-expert weights preserved: Attention (MLA), shared expert, embeddings, and LM head unchanged
- Saliency ordering verified: In every layer,
min(retained_saliency) > max(pruned_saliency)selecting the top 72
What is REAP?
REAP (Cerebras Research, 2025) is a one-shot expert pruning method for MoE models:
S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]
Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output.
What is PRISM?
The base model was treated with the PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality.
Optimization Notes
- Expert Parallelism (EP) is critical on PCIe GPUs -- reduces per-GPU model memory from ~23 GB to ~17 GB
- TRITON_MLA is the only MLA backend available on Ampere (CC 8.6)
- FP8 KV cache is not supported with MLA on Ampere; MLA's built-in KV compression (kv_lora_rank=512) already provides ~14x efficiency vs standard MHA
- Uniform GPU power limits prevent synchronization stalls in TP/EP configurations
Citation
@article{reap2025,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Cerebras Research},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
Acknowledgments
- moonshotai/Kimi-K2.5 -- base model
- Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B -- 192-expert intermediate
- Cerebras REAP -- pruning method
- PRISM -- over-bias removal
Support
If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai
Support and links
- Donate: https://donate.sybilsolutions.ai
- X: https://x.com/0xsero
- GitHub: https://github.com/0xsero
- Downloads last month
- 3,052
Model tree for 0xSero/Kimi-K2.5-PRISM-REAP-72
Base model
moonshotai/Kimi-K2.5