Qwen3.5 MoE 2.22B (from Qwen3.5-2B)

A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:

Model Details

Property Value
Total Parameters 2,220,761,920 (2.22B)
Active Parameters 1,612,063,552 (1.61B)
Architecture Qwen3.5 Hybrid MoE
Experts 8 routed + 1 shared, top-2
Hidden Size 2048
Layers 24 (hybrid: DeltaNet + full attention)
Attention GQA 8Q / 2KV, head_dim=256
Context 262,144 tokens
Vocab 248,320
Dtype bfloat16

Design

Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts

  • shared expert are active per token (~1/3 of total FFN).

Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.

Weight Transfer Sources

Component Source Strategy
Embeddings, LM Head Qwen/Qwen3.5-2B Exact copy
Attention (Q/K/V/O, norms) Qwen/Qwen3.5-2B Exact copy
DeltaNet (linear attention) Qwen/Qwen3.5-2B Exact copy
Vision encoder Qwen/Qwen3.5-2B Exact copy
Layer norms Qwen/Qwen3.5-2B Exact copy
Routed experts Qwen3.5-35B-A3B Slice 256->8, bilinear resize
Shared expert Qwen3.5-35B-A3B Bilinear resize
Router Qwen3.5-35B-A3B Slice + resize

License

Apache 2.0 (following source models)

Downloads last month
79
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kshitijthakkar/qwen3.5-moe-2.3B-d2B