Qwen3.5-397B-A17B — Unsloth Dynamic AWQ (4-bit, mlx-node)

Mixed-precision 4-bit quantization of Qwen/Qwen3.5-397B-A17B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

This model adapts the approach pioneered by @Brooooooklyn for porting Unsloth dynamic quantizations to MLX format. See the Brooooooklyn Qwen3.5 Unsloth MLX collection for the reference implementations that made this possible.

	Original (BF16)	This Model
Parameters	397B total / 17B active	397B total / 17B active
Size	~794 GB	~228 GB
Format	SafeTensors	SafeTensors (31 shards)
Precision	BF16 uniform	Mixed 4/5/6-bit + BF16
Reduction	—	~71%

Validation

Structural validation against mlx-community/Qwen3.5-397B-A17B-4bit:

All 8 linear attention diagnostic groups PASS (cosine ≥ 0.995)
A_log and dt_bias: cosine = 1.000

Functional validation:

ToolCall-15 benchmark: 26/30 (87%) — A:6/6, B:5/6, C:6/6, D:5/6, E:4/6
Coherent text generation, tool calling, and structured output confirmed

Quantization Strategy

This model's quantization recipe is based on the Unsloth team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as Unsloth Dynamic 2.0 — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.

Why Qwen3.5 Needs Special Treatment

Qwen3.5 is not a standard transformer. It uses a hybrid architecture: GatedDeltaNet linear attention layers + standard full attention layers (full_attention_interval=4). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has fundamentally different sensitivity profiles across layer types:

Quantizing ssm_out (linear_attn.out_proj) at Q2_K "does dramatically worse" — KLD spikes far beyond other components
Attention tensors (self_attn.*) are "especially sensitive for hybrid architectures"
Attention gates (linear_attn.in_proj_z) — MXFP4 "performs poorly" on these
FFN gate/up projections are "generally ok to quantize to 3-bit"
FFN down_proj is "slightly more sensitive" than gate/up — benefits from an extra bit

GGUF→MLX Conversion: Multi-Level Head Deinterleave Fix

Converting from GGUF to MLX requires fixing the head deinterleave applied by llama.cpp to the linear attention (GatedDeltaNet) tensors. GGUF stores the value-head dimension in a recursively deinterleaved order. For models where n_value_heads / n_key_heads > 2, the deinterleave has depth log₂(n_value_heads / n_key_heads).

For this model: n_value_heads=64, n_key_heads=16 → depth = log₂(64/16) = 2 levels.

The converter applies multi-level reinterleave operations at the correct depth for all linear attention tensors:

A_log, dt_bias — 1D per-head tensors
in_proj_a, in_proj_b, in_proj_z, out_proj — 2D weight matrices
in_proj_qkv and conv1d — V-portion rows only

Architecture

Qwen3.5-397B-A17B is a decoder-only MoE transformer with hybrid attention:

Parameter	Value
Hidden size	4,096
Layers	60 (45 linear + 15 full attention)
Attention heads	16
KV heads (full attn)	4 (GQA 4:1)
Head dimension	256 (full attn), 128 (linear attn)
Linear value heads	64
Linear key heads	16
MoE experts	512 per layer, 8 active per token
Vocab size	248,320
Max context	262,144 tokens
Total parameters	~397B
Active parameters	~17B per token

Layer pattern (repeating): [linear, linear, linear, full, ...]

Linear attention layers use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
Full attention layers use standard grouped-query attention with KV caching

Usage

# With vMLX
vmlx serve jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit \
  --port 8000 --host 0.0.0.0 --max-tokens 8192 \
  --continuous-batching --enable-prefix-cache \
  --enable-auto-tool-choice --tool-call-parser qwen3

# With mlx-lm
python -m mlx_lm.generate \
  --model jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit \
  --prompt "Hello, world!"

Target hardware: Apple M4 Ultra with 512 GB unified memory.

How It Was Made

Converted from unsloth/Qwen3.5-397B-A17B-GGUF (BF16 GGUF with imatrix) using mlx-node's GGUF→MLX conversion pipeline:

mlx convert \
  --input Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
  --output Qwen3.5-397B-A17B-unsloth-mlx-4bit \
  --dtype bfloat16 --quantize \
  --q-bits 4 --q-group-size 64 --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf_file

Acknowledgments

Brooooooklyn (Qwen3.5 Unsloth MLX collection) — Pioneered the approach of porting Unsloth dynamic quantizations to MLX format. Their reference implementations were used to validate tensor correctness of this conversion.
Unsloth (GitHub) — The quantization strategy is based on Unsloth's per-layer KL divergence benchmarks and their Dynamic 2.0 methodology. We use their published imatrix GGUF files for AWQ pre-scaling.
mlx-node — The GGUF→MLX converter with multi-level head deinterleave support for Qwen3.5 linear attention.
Qwen Team — For the Qwen3.5 model family and hybrid attention architecture.
Apple MLX — For the Metal-accelerated ML framework.

License

This model inherits the Apache 2.0 license from the base Qwen3.5-397B-A17B model.

Downloads last month: 185

Safetensors

Model size

68B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for jackzampolin/Qwen3.5-397B-A17B-unsloth-mlx-4bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(53)

this model