Kimi-K2.5-PRISM-REAP-72

WEIGHTS ARE BROKEN! DO NOT USE THIS FOR PRODUCTION SYSTEMS, I AM KEEPING IT FOR RESEARCH SAKE IT IS A 80%~ PRUNE OF AN INT4 QUANTIZED MODEL.

81% REAP expert-pruned version of moonshotai/Kimi-K2.5, further pruned from the PRISM-REAP 192-expert variant. Designed to fit on 8x RTX 3090 (24GB) consumer GPUs.

Property	Value
Architecture	KimiK25 (DeepSeekV3 backbone, MLA attention)
Total Parameters	~200B (down from ~1T)
Active Parameters	~32B (8 experts per token)
Experts per MoE Layer	72 routed + 1 shared (down from 384 + 1)
MoE Layers	60 (layers 1-60, layer 0 is dense)
Hidden Size	7168
Attention	MLA (kv_lora_rank=512, q_lora_rank=1536)
Quantization	W4A16 (group_size=32, symmetric) via compressed-tensors
Disk Size	122 GB (down from 289 GB / 555 GB original)
Pruning Method	REAP (Router-weighted Expert Activation Pruning)
Vision	Supported (inherited from Kimi-K2.5)

Why 72 Experts?

72 was chosen because:

Divisible by 8: Clean sharding across 8 GPUs for TP/EP
~122 GB total: Fits in 8x 24GB with room for KV cache
~15 GB/GPU weight footprint with Expert Parallelism, leaving ~7 GB for KV cache and overhead
Retains the top 72 most salient experts per layer from the original 384

Performance (8x RTX 3090, 155W, vLLM 0.15.1)

Metric	Value
Single request	33.4 tok/s
2 concurrent	52.5 tok/s
4 concurrent	86.2 tok/s
8 concurrent	145.5 tok/s
TTFT	0.08s
Max context	57,344 tokens
Vision	Working

Recommended vLLM Launch (8x RTX 3090)

VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve 0xsero/Kimi-K2.5-PRISM-REAP-72 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 57344 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xsero/Kimi-K2.5-PRISM-REAP-72",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xsero/Kimi-K2.5-PRISM-REAP-72",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Pruning Details

This model was created in a two-stage process:

Stage 1 (Ex0bit): REAP pruning of original 384 experts to 192 experts using saliency scores from 512 calibration samples on allenai/tulu-3-sft-mixture
Stage 2 (this model): Further pruning from 192 to 72 experts using the same REAP saliency scores, targeting consumer GPU deployment

Key Technical Details

Per-layer top-72 selection: The 72 most salient experts retained independently per layer
Gate weight slicing: Router gate weights [192, 7168] sliced to [72, 7168], e_score_correction_bias from [192] to [72]
Contiguous expert remapping: Expert indices remapped to 0-71 in each layer
All non-expert weights preserved: Attention (MLA), shared expert, embeddings, and LM head unchanged
Saliency ordering verified: In every layer, min(retained_saliency) > max(pruned_saliency) selecting the top 72

What is REAP?

REAP (Cerebras Research, 2025) is a one-shot expert pruning method for MoE models:

S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]

Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output.

What is PRISM?

The base model was treated with the PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality.

Optimization Notes

Expert Parallelism (EP) is critical on PCIe GPUs -- reduces per-GPU model memory from ~23 GB to ~17 GB
TRITON_MLA is the only MLA backend available on Ampere (CC 8.6)
FP8 KV cache is not supported with MLA on Ampere; MLA's built-in KV compression (kv_lora_rank=512) already provides ~14x efficiency vs standard MHA
Uniform GPU power limits prevent synchronization stalls in TP/EP configurations

Citation

@article{reap2025,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Cerebras Research},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Acknowledgments

moonshotai/Kimi-K2.5 -- base model
Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B -- 192-expert intermediate
Cerebras REAP -- pruning method
PRISM -- over-bias removal

Support

If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai

Support and links

Donate: https://donate.sybilsolutions.ai
X: https://x.com/0xsero
GitHub: https://github.com/0xsero

Downloads last month: 3,052

Safetensors

Model size

91B params

Tensor type

I32

BF16

Model tree for 0xSero/Kimi-K2.5-PRISM-REAP-72

Base model

moonshotai/Kimi-K2.5

Quantized

Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B

Quantized

(1)

this model

Paper for 0xSero/Kimi-K2.5-PRISM-REAP-72

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 18