Qwen3.5-397B-A17B — Unsloth Dynamic AWQ (4-bit, mlx-node)
Mixed-precision 4-bit quantization of Qwen/Qwen3.5-397B-A17B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.
This model adapts the approach pioneered by @Brooooooklyn for porting Unsloth dynamic quantizations to MLX format. See the Brooooooklyn Qwen3.5 Unsloth MLX collection for the reference implementations that made this possible.
| Original (BF16) | This Model | |
|---|---|---|
| Parameters | 397B total / 17B active | 397B total / 17B active |
| Size | ~794 GB | ~228 GB |
| Format | SafeTensors | SafeTensors (31 shards) |
| Precision | BF16 uniform | Mixed 4/5/6-bit + BF16 |
| Reduction | — | ~71% |
Validation
Structural validation against mlx-community/Qwen3.5-397B-A17B-4bit:
- All 8 linear attention diagnostic groups PASS (cosine ≥ 0.995)
- A_log and dt_bias: cosine = 1.000
Functional validation:
- ToolCall-15 benchmark: 26/30 (87%) — A:6/6, B:5/6, C:6/6, D:5/6, E:4/6
- Coherent text generation, tool calling, and structured output confirmed
Quantization Strategy
This model's quantization recipe is based on the Unsloth team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as Unsloth Dynamic 2.0 — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.
Why Qwen3.5 Needs Special Treatment
Qwen3.5 is not a standard transformer. It uses a hybrid architecture: GatedDeltaNet linear attention layers + standard full attention layers (full_attention_interval=4). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has fundamentally different sensitivity profiles across layer types:
- Quantizing
ssm_out(linear_attn.out_proj) at Q2_K "does dramatically worse" — KLD spikes far beyond other components - Attention tensors (
self_attn.*) are "especially sensitive for hybrid architectures" - Attention gates (
linear_attn.in_proj_z) — MXFP4 "performs poorly" on these - FFN gate/up projections are "generally ok to quantize to 3-bit"
- FFN
down_projis "slightly more sensitive" than gate/up — benefits from an extra bit
GGUF→MLX Conversion: Multi-Level Head Deinterleave Fix
Converting from GGUF to MLX requires fixing the head deinterleave applied by llama.cpp to the linear attention (GatedDeltaNet) tensors. GGUF stores the value-head dimension in a recursively deinterleaved order. For models where n_value_heads / n_key_heads > 2, the deinterleave has depth log₂(n_value_heads / n_key_heads).
For this model: n_value_heads=64, n_key_heads=16 → depth = log₂(64/16) = 2 levels.
The converter applies multi-level reinterleave operations at the correct depth for all linear attention tensors:
A_log,dt_bias— 1D per-head tensorsin_proj_a,in_proj_b,in_proj_z,out_proj— 2D weight matricesin_proj_qkvandconv1d— V-portion rows only
Architecture
Qwen3.5-397B-A17B is a decoder-only MoE transformer with hybrid attention:
| Parameter | Value |
|---|---|
| Hidden size | 4,096 |
| Layers | 60 (45 linear + 15 full attention) |
| Attention heads | 16 |
| KV heads (full attn) | 4 (GQA 4:1) |
| Head dimension | 256 (full attn), 128 (linear attn) |
| Linear value heads | 64 |
| Linear key heads | 16 |
| MoE experts | 512 per layer, 8 active per token |
| Vocab size | 248,320 |
| Max context | 262,144 tokens |
| Total parameters | ~397B |
| Active parameters | ~17B per token |
Layer pattern (repeating): [linear, linear, linear, full, ...]
- Linear attention layers use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
- Full attention layers use standard grouped-query attention with KV caching
Usage
# With vMLX
vmlx serve jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit \
--port 8000 --host 0.0.0.0 --max-tokens 8192 \
--continuous-batching --enable-prefix-cache \
--enable-auto-tool-choice --tool-call-parser qwen3
# With mlx-lm
python -m mlx_lm.generate \
--model jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit \
--prompt "Hello, world!"
Target hardware: Apple M4 Ultra with 512 GB unified memory.
How It Was Made
Converted from unsloth/Qwen3.5-397B-A17B-GGUF (BF16 GGUF with imatrix) using mlx-node's GGUF→MLX conversion pipeline:
mlx convert \
--input Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
--output Qwen3.5-397B-A17B-unsloth-mlx-4bit \
--dtype bfloat16 --quantize \
--q-bits 4 --q-group-size 64 --q-recipe unsloth \
--imatrix-path imatrix_unsloth.gguf_file
Acknowledgments
- Brooooooklyn (Qwen3.5 Unsloth MLX collection) — Pioneered the approach of porting Unsloth dynamic quantizations to MLX format. Their reference implementations were used to validate tensor correctness of this conversion.
- Unsloth (GitHub) — The quantization strategy is based on Unsloth's per-layer KL divergence benchmarks and their Dynamic 2.0 methodology. We use their published imatrix GGUF files for AWQ pre-scaling.
- mlx-node — The GGUF→MLX converter with multi-level head deinterleave support for Qwen3.5 linear attention.
- Qwen Team — For the Qwen3.5 model family and hybrid attention architecture.
- Apple MLX — For the Metal-accelerated ML framework.
License
This model inherits the Apache 2.0 license from the base Qwen3.5-397B-A17B model.
- Downloads last month
- 185
4-bit
Model tree for jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit
Base model
Qwen/Qwen3.5-397B-A17B