Kimi-K2.5-PRISM-REAP-72

WEIGHTS ARE BROKEN! DO NOT USE THIS FOR PRODUCTION SYSTEMS, I AM KEEPING IT FOR RESEARCH SAKE IT IS A 80%~ PRUNE OF AN INT4 QUANTIZED MODEL.

81% REAP expert-pruned version of moonshotai/Kimi-K2.5, further pruned from the PRISM-REAP 192-expert variant. Designed to fit on 8x RTX 3090 (24GB) consumer GPUs.

Property Value
Architecture KimiK25 (DeepSeekV3 backbone, MLA attention)
Total Parameters ~200B (down from ~1T)
Active Parameters ~32B (8 experts per token)
Experts per MoE Layer 72 routed + 1 shared (down from 384 + 1)
MoE Layers 60 (layers 1-60, layer 0 is dense)
Hidden Size 7168
Attention MLA (kv_lora_rank=512, q_lora_rank=1536)
Quantization W4A16 (group_size=32, symmetric) via compressed-tensors
Disk Size 122 GB (down from 289 GB / 555 GB original)
Pruning Method REAP (Router-weighted Expert Activation Pruning)
Vision Supported (inherited from Kimi-K2.5)

Why 72 Experts?

72 was chosen because:

  • Divisible by 8: Clean sharding across 8 GPUs for TP/EP
  • ~122 GB total: Fits in 8x 24GB with room for KV cache
  • ~15 GB/GPU weight footprint with Expert Parallelism, leaving ~7 GB for KV cache and overhead
  • Retains the top 72 most salient experts per layer from the original 384

Performance (8x RTX 3090, 155W, vLLM 0.15.1)

Metric Value
Single request 33.4 tok/s
2 concurrent 52.5 tok/s
4 concurrent 86.2 tok/s
8 concurrent 145.5 tok/s
TTFT 0.08s
Max context 57,344 tokens
Vision Working

Recommended vLLM Launch (8x RTX 3090)

VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve 0xsero/Kimi-K2.5-PRISM-REAP-72 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 57344 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 16 \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xsero/Kimi-K2.5-PRISM-REAP-72",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xsero/Kimi-K2.5-PRISM-REAP-72",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Pruning Details

This model was created in a two-stage process:

  1. Stage 1 (Ex0bit): REAP pruning of original 384 experts to 192 experts using saliency scores from 512 calibration samples on allenai/tulu-3-sft-mixture
  2. Stage 2 (this model): Further pruning from 192 to 72 experts using the same REAP saliency scores, targeting consumer GPU deployment

Key Technical Details

  • Per-layer top-72 selection: The 72 most salient experts retained independently per layer
  • Gate weight slicing: Router gate weights [192, 7168] sliced to [72, 7168], e_score_correction_bias from [192] to [72]
  • Contiguous expert remapping: Expert indices remapped to 0-71 in each layer
  • All non-expert weights preserved: Attention (MLA), shared expert, embeddings, and LM head unchanged
  • Saliency ordering verified: In every layer, min(retained_saliency) > max(pruned_saliency) selecting the top 72

What is REAP?

REAP (Cerebras Research, 2025) is a one-shot expert pruning method for MoE models:

S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]

Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output.

What is PRISM?

The base model was treated with the PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality.

Optimization Notes

  • Expert Parallelism (EP) is critical on PCIe GPUs -- reduces per-GPU model memory from ~23 GB to ~17 GB
  • TRITON_MLA is the only MLA backend available on Ampere (CC 8.6)
  • FP8 KV cache is not supported with MLA on Ampere; MLA's built-in KV compression (kv_lora_rank=512) already provides ~14x efficiency vs standard MHA
  • Uniform GPU power limits prevent synchronization stalls in TP/EP configurations

Citation

@article{reap2025,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Cerebras Research},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Acknowledgments

Support

If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai

Support and links

Downloads last month
3,052
Safetensors
Model size
91B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/Kimi-K2.5-PRISM-REAP-72

Quantized
(1)
this model

Paper for 0xSero/Kimi-K2.5-PRISM-REAP-72