Qwen3.5-9B — Unsloth Dynamic AWQ (3-bit, mlx-node)

Mixed-precision 3/4/5/6-bit quantization of Qwen/Qwen3.5-9B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

	Original (BF16)	This Model
Parameters	9,653,104,368	9,653,104,368
Size	18 GB	6.4 GB
Format	SafeTensors (4 shards)	SafeTensors (single file, 1100 tensors)
Precision	BF16 uniform	Mixed 3/4/5/6-bit + BF16
Reduction	—	64%

Performance

Tested on Apple Silicon M3 Max 128GB with mlx-node:

Model	Size	Decode (tok/s)	Speedup
BF16 (unquantized)	18 GB	20.5–21.0	baseline
This model (Unsloth, 3-bit base)	6.4 GB	54.1–54.6	~2.6x faster

Decode is memory-bandwidth bound on Apple Silicon — fewer bytes to transfer per token directly translates to higher throughput. Embeddings and lm_head stay quantized in memory (5/6-bit) and use quantized_matmul on forward — no dequantize-at-load overhead. Attention q/k/v and SSM input projections are quantized at 5-bit with imatrix AWQ pre-scaling for near-lossless quality. Attention o_proj and SSM out_proj are kept at bf16 (no preceding norm for AWQ correction).

Quantization Strategy

This model's quantization recipe is based on the Unsloth team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as Unsloth Dynamic 2.0 — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.

Why Qwen3.5 Needs Special Treatment

Qwen3.5 is not a standard transformer. It uses a hybrid architecture: 24 GatedDeltaNet linear attention layers + 8 standard full attention layers (full_attention_interval=4). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has fundamentally different sensitivity profiles across layer types:

Quantizing ssm_out (linear_attn.out_proj) at Q2_K "does dramatically worse" — KLD spikes far beyond other components
Attention tensors (self_attn.*) are "especially sensitive for hybrid architectures" — more so than in pure-attention models like LLaMA
Attention gates (linear_attn.in_proj_z) — MXFP4 "performs poorly" on these
FFN gate/up projections are "generally ok to quantize to 3-bit" — the only layers that tolerate aggressive compression well
FFN down_proj is "slightly more sensitive" than gate/up — benefits from an extra bit

The key insight from Unsloth's work: it's better to quantize sensitive layers at higher bits (5-bit with imatrix) and aggressively quantize the rest (3-bit), than to uniformly quantize everything at a middling bit-width. Their Dynamic models consistently sit on the Pareto frontier for 99.9% KL divergence vs model size, outperforming uniform quantization at every size point.

Unsloth imatrix: The Calibration Foundation

The second pillar of Unsloth's approach is their importance matrix (imatrix) — per-channel calibration data that tells the quantizer which channels within each tensor carry the most information.

Standard imatrix calibrations (used by most GGUF quantizers) run the model on Wikipedia-512 — short encyclopedia passages. Unsloth instead calibrates on long-context chat, coding, and tool-calling data, which better represents how these models are actually used. From Unsloth's findings:

"Imatrix definitely helps reduce KLD & PPL" across all bit-widths
"Imatrix generally helps on lower bits, and works on all quants and bit widths"
SSM output at 2-bits was "really bad" without imatrix, but imatrix "reduces the 99.9% KLD by a lot"
Trade-off: I-quants make "inference 5-10% slower", but the quality gain is substantial

When an imatrix is provided to mlx-node's conversion pipeline, it applies AWQ-style channel pre-scaling before quantization: important input channels (high activation magnitude) are amplified to make them more quantization-resistant, while less important channels are shrunk. The inverse scales are fused into preceding layer norms, so there is zero inference overhead — the quality improvement is free at runtime.

Per-Layer Decisions

Based on Unsloth's per-tensor 99.9% KLD analysis (sorted by sensitivity, worst → best):

Component	Precision	Count	Unsloth Finding (99.9% KLD)
`self_attn.{q,k,v}_proj`	5-bit affine (gs=64) + AWQ	24 tensors	KLD ~1.5–2.9 — "Especially sensitive for hybrid architectures"; AWQ-corrected via input_layernorm
`self_attn.o_proj`	BF16 (skip)	8 tensors	KLD ~1.5 — no preceding norm for AWQ correction
`linear_attn.in_proj_qkv`	5-bit affine (gs=64) + AWQ	24 tensors	KLD ~2.9 — SSM input projection; AWQ-corrected via input_layernorm
`linear_attn.in_proj_z`	5-bit affine (gs=64) + AWQ	24 tensors	KLD ~1.5 — "Performs poorly with MXFP4"; AWQ-corrected via input_layernorm
`linear_attn.out_proj`	BF16 (skip)	24 tensors	KLD ~6.0 at q2_k — worst tensor; no preceding norm for AWQ correction
`linear_attn.A_log`	BF16 (skip)	24 tensors	State-space dynamics — not quantizable
`linear_attn.conv1d`	BF16 (skip)	24 tensors	KLD ~0.05 — too small to quantize meaningfully
`linear_attn.in_proj_{a,b}`	BF16 (skip)	48 tensors	Low-rank projections — too small
`mlp.down_proj`	4-bit affine (gs=64)	32 tensors	"Slightly more sensitive" than gate/up
`mlp.gate_proj`	3-bit affine (gs=64)	32 tensors	"Generally ok to quantize to 3-bit"
`mlp.up_proj`	3-bit affine (gs=64)	32 tensors	"Generally ok to quantize to 3-bit"
`embed_tokens`	5-bit affine (gs=64)	1 tensor	KLD ~0.15 at q5_k — among least sensitive
`lm_head`	6-bit affine (gs=64)	1 tensor	KLD ~0.05 at q5_k — safest tensor to quantize
Norms	BF16 (skip)	~130 tensors	Never quantized (standard practice)

AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 5-bit with imatrix AWQ pre-scaling via input_layernorm. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer, so AWQ cannot be applied. imatrix is required for the unsloth recipe.

Comparison with Unsloth GGUF (UD-Q3_K_XL)

Tensor	Unsloth UD-Q3_K_XL	Ours	Gap
attn q/k/v	Q5_K + imatrix	5-bit affine + AWQ	Small (AWQ compensates)
in_proj_qkv/z	Q5_K + imatrix	5-bit affine + AWQ	Small
o_proj	Q5_K + imatrix	bf16	We're larger but lossless
out_proj	Q5_K + imatrix	bf16	We're larger but lossless
FFN gate/up	Q3_K + imatrix	3-bit affine + AWQ	Moderate (K-quant > affine at 3-bit)
FFN down	Q4_K + imatrix	4-bit affine + AWQ	Small

Architecture

Qwen3.5-9B is a decoder-only transformer with a hybrid attention design:

Parameter	Value
Hidden size	4,096
Layers	32 (24 linear + 8 full attention)
Attention heads	16 (4 KV heads, GQA 4:1)
Head dimension	256
Intermediate size	12,288
Vocab size	248,320
Max context	262,144 tokens
RoPE	M-RoPE with `mrope_section=[11, 11, 10]`, theta=10M
Activation	SiLU

Layer pattern (repeating): [linear, linear, linear, full, linear, linear, linear, full, ...]

Linear attention layers use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
Full attention layers use standard grouped-query attention with KV caching

Usage

With mlx-node (TypeScript/JavaScript)

import { loadModel } from '@mlx-node/lm';

const model = await loadModel('./qwen3.5-9B-unsloth');

// Chat (single-shot)
const result = await model.chat(
  [{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }],
  { maxNewTokens: 2048, temperature: 0.6, enableThinking: false },
);
console.log(result.text);

// Streaming (AsyncGenerator)
for await (const event of model.chatStream(
  [{ role: 'user', content: 'Write a haiku about coding.' }],
  { maxNewTokens: 512, temperature: 0.7 },
)) {
  if (!event.done) {
    process.stdout.write(event.text);
  } else {
    console.log('\nTokens:', event.numTokens);
  }
}

// Tool calling
import { createToolDefinition } from '@mlx-node/lm';

const tools = [
  createToolDefinition(
    'get_weather',
    'Get weather for a location',
    { location: { type: 'string', description: 'City name' } },
    ['location'],
  ),
];

const result = await model.chat(
  [{ role: 'user', content: 'What is the weather in Tokyo?' }],
  { tools, maxNewTokens: 2048 },
);
for (const call of result.toolCalls) {
  console.log(call.name, call.arguments);
}

How It Was Made

Converted from Qwen/Qwen3.5-9B official SafeTensors using mlx-node's conversion pipeline:

mlx convert \
  -i .cache/models/qwen3.5-9B \
  -o .cache/models/qwen3.5-9B-unsloth \
  -q --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

The --q-recipe unsloth flag applies the differential quantization strategy described above. The recipe defaults to 3-bit base (override with --q-bits). The --imatrix-path is required for the unsloth recipe — it applies AWQ-style channel pre-scaling before quantization using Unsloth's importance matrix. The conversion pipeline:

Loads BF16 SafeTensors/GGUF weights via mmap (near-instant)
Applies Qwen3.5-specific weight sanitization (norm +1.0 shift, dtype handling)
Applies imatrix AWQ pre-scaling: important input channels are amplified (more quantization-resistant) while less important channels are shrunk, with inverse scales fused into preceding layer norms
Runs the Unsloth recipe predicate to classify each tensor
Quantizes attn q/k/v + SSM in_proj to 5-bit (AWQ-corrected), MLP gate/up to 3-bit, down to 4-bit, embed to 5-bit, lm_head to 6-bit
Skips o_proj, out_proj, norms, A_log, conv1d, and low-rank projections (kept BF16)
Writes single-file SafeTensors with per-layer quantization metadata in config.json

Unsloth's imatrix uses long-context chat, coding, and tool-calling calibration data rather than standard Wikipedia-512 contexts. From Unsloth's findings: imatrix "definitely helps reduce KLD & PPL" across all bit-widths, and is especially impactful at lower bits (3-bit and below).

Files

File	Size	Description
`model.safetensors`	6.4 GB	Mixed-precision model weights
`config.json`	30 KB	Model config + per-layer quantization overrides
`tokenizer.json`	12 MB	HuggingFace tokenizer (248K vocab)
`tokenizer_config.json`	16 KB	Tokenizer settings + Jinja2 chat template
`vocab.json`	6.4 MB	Vocabulary mapping
`merges.txt`	3.2 MB	BPE merges

Chat Template

The official Qwen3.5 chat template is preserved unmodified, supporting:

Multi-turn conversation
System messages
Tool calling (<tool_call> / </tool_call> tags)
Chain-of-thought reasoning (<think> / </think> tags)
Image/video content placeholders (for VLM variants)

Template compatibility fix: The official Qwen3.5 template uses raise_exception() for input validation (8 call sites), which is not a built-in function in most Jinja2-compatible renderers. Unsloth identified and fixed chat template issues affecting tool-calling across all Qwen3.5 variants. mlx-node takes a complementary approach — rather than patching the template, we register raise_exception as a native function in our minijinja renderer, so the official template works as-is without modification.

Acknowledgments

Unsloth (GitHub) — The quantization strategy in this model is directly based on Unsloth's per-layer KL divergence benchmarks and their Dynamic 2.0 quantization methodology. Their work on imatrix calibration with long-context chat and tool-calling data, and their systematic analysis of layer sensitivity in hybrid GatedDeltaNet architectures, made this recipe possible. We also use their published imatrix GGUF files for AWQ pre-scaling when converting from GGUF sources.
Qwen Team — For the Qwen3.5 model family and the hybrid attention architecture
Apple MLX — For the Metal-accelerated ML framework powering inference

License

This model inherits the Apache 2.0 license from the base Qwen3.5-9B model.

Downloads last month: 2,160

Safetensors

Model size

2B params

Tensor type

BF16

U32

MLX

Hardware compatibility

3-bit

Model tree for Brooooooklyn/Qwen3.5-9B-unsloth-mlx

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(149)

this model

Collection including Brooooooklyn/Qwen3.5-9B-unsloth-mlx

Qwen-3.5-unsloth-mlx

Collection

AWQ-style pre-scaling using Unsloth's imatrix calibration data, then 3-6-bit affine quantization with the Unsloth mixed-precision recipe via MLX • 20 items • Updated 4 days ago • 17