Qwen3.5 MoE 2.22B (from Qwen3.5-2B)

A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:

Backbone (attention, embeddings, vision, norms): from Qwen/Qwen3.5-2B
MoE experts (routed + shared): from Qwen/Qwen3.5-35B-A3B (sliced 256->8 experts, bilinear resized)

Model Details

Property	Value
Total Parameters	2,220,761,920 (2.22B)
Active Parameters	1,612,063,552 (1.61B)
Architecture	Qwen3.5 Hybrid MoE
Experts	8 routed + 1 shared, top-2
Hidden Size	2048
Layers	24 (hybrid: DeltaNet + full attention)
Attention	GQA 8Q / 2KV, head_dim=256
Context	262,144 tokens
Vocab	248,320
Dtype	bfloat16

Design

Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts

shared expert are active per token (~1/3 of total FFN).

Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.

Weight Transfer Sources

Component	Source	Strategy
Embeddings, LM Head	Qwen/Qwen3.5-2B	Exact copy
Attention (Q/K/V/O, norms)	Qwen/Qwen3.5-2B	Exact copy
DeltaNet (linear attention)	Qwen/Qwen3.5-2B	Exact copy
Vision encoder	Qwen/Qwen3.5-2B	Exact copy
Layer norms	Qwen/Qwen3.5-2B	Exact copy
Routed experts	Qwen3.5-35B-A3B	Slice 256->8, bilinear resize
Shared expert	Qwen3.5-35B-A3B	Bilinear resize
Router	Qwen3.5-35B-A3B	Slice + resize

License

Apache 2.0 (following source models)

Downloads last month: 79

Safetensors

Model size

3B params

Tensor type

BF16

Collection including kshitijthakkar/qwen3.5-moe-2.3B-d2B

Qwen3.5 Dense-to-MoE Weight Transfer

Collection

Qwen3.5 MoE models from dual-source weight transfer (dense backbone + 35B-A3B experts). Hybrid DeltaNet + GQA attention. • 6 items • Updated 29 days ago