Qwen3.5-397B-A17B — Unsloth Dynamic AWQ (4-bit, mlx-node)

Mixed-precision 4-bit quantization of Qwen/Qwen3.5-397B-A17B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

This model adapts the approach pioneered by @Brooooooklyn for porting Unsloth dynamic quantizations to MLX format. See the Brooooooklyn Qwen3.5 Unsloth MLX collection for the reference implementations that made this possible.

Original (BF16) This Model
Parameters 397B total / 17B active 397B total / 17B active
Size ~794 GB ~228 GB
Format SafeTensors SafeTensors (31 shards)
Precision BF16 uniform Mixed 4/5/6-bit + BF16
Reduction ~71%

Validation

Structural validation against mlx-community/Qwen3.5-397B-A17B-4bit:

  • All 8 linear attention diagnostic groups PASS (cosine ≥ 0.995)
  • A_log and dt_bias: cosine = 1.000

Functional validation:

  • ToolCall-15 benchmark: 26/30 (87%) — A:6/6, B:5/6, C:6/6, D:5/6, E:4/6
  • Coherent text generation, tool calling, and structured output confirmed

Quantization Strategy

This model's quantization recipe is based on the Unsloth team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as Unsloth Dynamic 2.0 — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.

Why Qwen3.5 Needs Special Treatment

Qwen3.5 is not a standard transformer. It uses a hybrid architecture: GatedDeltaNet linear attention layers + standard full attention layers (full_attention_interval=4). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has fundamentally different sensitivity profiles across layer types:

  • Quantizing ssm_out (linear_attn.out_proj) at Q2_K "does dramatically worse" — KLD spikes far beyond other components
  • Attention tensors (self_attn.*) are "especially sensitive for hybrid architectures"
  • Attention gates (linear_attn.in_proj_z) — MXFP4 "performs poorly" on these
  • FFN gate/up projections are "generally ok to quantize to 3-bit"
  • FFN down_proj is "slightly more sensitive" than gate/up — benefits from an extra bit

GGUF→MLX Conversion: Multi-Level Head Deinterleave Fix

Converting from GGUF to MLX requires fixing the head deinterleave applied by llama.cpp to the linear attention (GatedDeltaNet) tensors. GGUF stores the value-head dimension in a recursively deinterleaved order. For models where n_value_heads / n_key_heads > 2, the deinterleave has depth log₂(n_value_heads / n_key_heads).

For this model: n_value_heads=64, n_key_heads=16 → depth = log₂(64/16) = 2 levels.

The converter applies multi-level reinterleave operations at the correct depth for all linear attention tensors:

  • A_log, dt_bias — 1D per-head tensors
  • in_proj_a, in_proj_b, in_proj_z, out_proj — 2D weight matrices
  • in_proj_qkv and conv1d — V-portion rows only

Architecture

Qwen3.5-397B-A17B is a decoder-only MoE transformer with hybrid attention:

Parameter Value
Hidden size 4,096
Layers 60 (45 linear + 15 full attention)
Attention heads 16
KV heads (full attn) 4 (GQA 4:1)
Head dimension 256 (full attn), 128 (linear attn)
Linear value heads 64
Linear key heads 16
MoE experts 512 per layer, 8 active per token
Vocab size 248,320
Max context 262,144 tokens
Total parameters ~397B
Active parameters ~17B per token

Layer pattern (repeating): [linear, linear, linear, full, ...]

  • Linear attention layers use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
  • Full attention layers use standard grouped-query attention with KV caching

Usage

# With vMLX
vmlx serve jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit \
  --port 8000 --host 0.0.0.0 --max-tokens 8192 \
  --continuous-batching --enable-prefix-cache \
  --enable-auto-tool-choice --tool-call-parser qwen3

# With mlx-lm
python -m mlx_lm.generate \
  --model jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit \
  --prompt "Hello, world!"

Target hardware: Apple M4 Ultra with 512 GB unified memory.

How It Was Made

Converted from unsloth/Qwen3.5-397B-A17B-GGUF (BF16 GGUF with imatrix) using mlx-node's GGUF→MLX conversion pipeline:

mlx convert \
  --input Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
  --output Qwen3.5-397B-A17B-unsloth-mlx-4bit \
  --dtype bfloat16 --quantize \
  --q-bits 4 --q-group-size 64 --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf_file

Acknowledgments

  • Brooooooklyn (Qwen3.5 Unsloth MLX collection) — Pioneered the approach of porting Unsloth dynamic quantizations to MLX format. Their reference implementations were used to validate tensor correctness of this conversion.
  • Unsloth (GitHub) — The quantization strategy is based on Unsloth's per-layer KL divergence benchmarks and their Dynamic 2.0 methodology. We use their published imatrix GGUF files for AWQ pre-scaling.
  • mlx-node — The GGUF→MLX converter with multi-level head deinterleave support for Qwen3.5 linear attention.
  • Qwen Team — For the Qwen3.5 model family and hybrid attention architecture.
  • Apple MLX — For the Metal-accelerated ML framework.

License

This model inherits the Apache 2.0 license from the base Qwen3.5-397B-A17B model.

Downloads last month
185
Safetensors
Model size
68B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit

Quantized
(53)
this model