Qwen3.5-9B — Unsloth Dynamic AWQ (3-bit, mlx-node)

Mixed-precision 3/4/5/6-bit quantization of Qwen/Qwen3.5-9B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) This Model
Parameters 9,653,104,368 9,653,104,368
Size 18 GB 6.4 GB
Format SafeTensors (4 shards) SafeTensors (single file, 1100 tensors)
Precision BF16 uniform Mixed 3/4/5/6-bit + BF16
Reduction — 64%

Performance

Tested on Apple Silicon M3 Max 128GB with mlx-node:

Model Size Decode (tok/s) Speedup
BF16 (unquantized) 18 GB 20.5–21.0 baseline
This model (Unsloth, 3-bit base) 6.4 GB 54.1–54.6 ~2.6x faster

Decode is memory-bandwidth bound on Apple Silicon — fewer bytes to transfer per token directly translates to higher throughput. Embeddings and lm_head stay quantized in memory (5/6-bit) and use quantized_matmul on forward — no dequantize-at-load overhead. Attention q/k/v and SSM input projections are quantized at 5-bit with imatrix AWQ pre-scaling for near-lossless quality. Attention o_proj and SSM out_proj are kept at bf16 (no preceding norm for AWQ correction).

Quantization Strategy

This model's quantization recipe is based on the Unsloth team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as Unsloth Dynamic 2.0 — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.

Why Qwen3.5 Needs Special Treatment

Qwen3.5 is not a standard transformer. It uses a hybrid architecture: 24 GatedDeltaNet linear attention layers + 8 standard full attention layers (full_attention_interval=4). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has fundamentally different sensitivity profiles across layer types:

  • Quantizing ssm_out (linear_attn.out_proj) at Q2_K "does dramatically worse" — KLD spikes far beyond other components
  • Attention tensors (self_attn.*) are "especially sensitive for hybrid architectures" — more so than in pure-attention models like LLaMA
  • Attention gates (linear_attn.in_proj_z) — MXFP4 "performs poorly" on these
  • FFN gate/up projections are "generally ok to quantize to 3-bit" — the only layers that tolerate aggressive compression well
  • FFN down_proj is "slightly more sensitive" than gate/up — benefits from an extra bit

The key insight from Unsloth's work: it's better to quantize sensitive layers at higher bits (5-bit with imatrix) and aggressively quantize the rest (3-bit), than to uniformly quantize everything at a middling bit-width. Their Dynamic models consistently sit on the Pareto frontier for 99.9% KL divergence vs model size, outperforming uniform quantization at every size point.

Unsloth imatrix: The Calibration Foundation

The second pillar of Unsloth's approach is their importance matrix (imatrix) — per-channel calibration data that tells the quantizer which channels within each tensor carry the most information.

Standard imatrix calibrations (used by most GGUF quantizers) run the model on Wikipedia-512 — short encyclopedia passages. Unsloth instead calibrates on long-context chat, coding, and tool-calling data, which better represents how these models are actually used. From Unsloth's findings:

  • "Imatrix definitely helps reduce KLD & PPL" across all bit-widths
  • "Imatrix generally helps on lower bits, and works on all quants and bit widths"
  • SSM output at 2-bits was "really bad" without imatrix, but imatrix "reduces the 99.9% KLD by a lot"
  • Trade-off: I-quants make "inference 5-10% slower", but the quality gain is substantial

When an imatrix is provided to mlx-node's conversion pipeline, it applies AWQ-style channel pre-scaling before quantization: important input channels (high activation magnitude) are amplified to make them more quantization-resistant, while less important channels are shrunk. The inverse scales are fused into preceding layer norms, so there is zero inference overhead — the quality improvement is free at runtime.

Per-Layer Decisions

Based on Unsloth's per-tensor 99.9% KLD analysis (sorted by sensitivity, worst → best):

Component Precision Count Unsloth Finding (99.9% KLD)
self_attn.{q,k,v}_proj 5-bit affine (gs=64) + AWQ 24 tensors KLD ~1.5–2.9 — "Especially sensitive for hybrid architectures"; AWQ-corrected via input_layernorm
self_attn.o_proj BF16 (skip) 8 tensors KLD ~1.5 — no preceding norm for AWQ correction
linear_attn.in_proj_qkv 5-bit affine (gs=64) + AWQ 24 tensors KLD ~2.9 — SSM input projection; AWQ-corrected via input_layernorm
linear_attn.in_proj_z 5-bit affine (gs=64) + AWQ 24 tensors KLD ~1.5 — "Performs poorly with MXFP4"; AWQ-corrected via input_layernorm
linear_attn.out_proj BF16 (skip) 24 tensors KLD ~6.0 at q2_k — worst tensor; no preceding norm for AWQ correction
linear_attn.A_log BF16 (skip) 24 tensors State-space dynamics — not quantizable
linear_attn.conv1d BF16 (skip) 24 tensors KLD ~0.05 — too small to quantize meaningfully
linear_attn.in_proj_{a,b} BF16 (skip) 48 tensors Low-rank projections — too small
mlp.down_proj 4-bit affine (gs=64) 32 tensors "Slightly more sensitive" than gate/up
mlp.gate_proj 3-bit affine (gs=64) 32 tensors "Generally ok to quantize to 3-bit"
mlp.up_proj 3-bit affine (gs=64) 32 tensors "Generally ok to quantize to 3-bit"
embed_tokens 5-bit affine (gs=64) 1 tensor KLD ~0.15 at q5_k — among least sensitive
lm_head 6-bit affine (gs=64) 1 tensor KLD ~0.05 at q5_k — safest tensor to quantize
Norms BF16 (skip) ~130 tensors Never quantized (standard practice)

AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 5-bit with imatrix AWQ pre-scaling via input_layernorm. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer, so AWQ cannot be applied. imatrix is required for the unsloth recipe.

Comparison with Unsloth GGUF (UD-Q3_K_XL)

Tensor Unsloth UD-Q3_K_XL Ours Gap
attn q/k/v Q5_K + imatrix 5-bit affine + AWQ Small (AWQ compensates)
in_proj_qkv/z Q5_K + imatrix 5-bit affine + AWQ Small
o_proj Q5_K + imatrix bf16 We're larger but lossless
out_proj Q5_K + imatrix bf16 We're larger but lossless
FFN gate/up Q3_K + imatrix 3-bit affine + AWQ Moderate (K-quant > affine at 3-bit)
FFN down Q4_K + imatrix 4-bit affine + AWQ Small

Architecture

Qwen3.5-9B is a decoder-only transformer with a hybrid attention design:

Parameter Value
Hidden size 4,096
Layers 32 (24 linear + 8 full attention)
Attention heads 16 (4 KV heads, GQA 4:1)
Head dimension 256
Intermediate size 12,288
Vocab size 248,320
Max context 262,144 tokens
RoPE M-RoPE with mrope_section=[11, 11, 10], theta=10M
Activation SiLU

Layer pattern (repeating): [linear, linear, linear, full, linear, linear, linear, full, ...]

  • Linear attention layers use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
  • Full attention layers use standard grouped-query attention with KV caching

Usage

With mlx-node (TypeScript/JavaScript)

import { loadModel } from '@mlx-node/lm';

const model = await loadModel('./qwen3.5-9B-unsloth');

// Chat (single-shot)
const result = await model.chat(
  [{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }],
  { maxNewTokens: 2048, temperature: 0.6, enableThinking: false },
);
console.log(result.text);

// Streaming (AsyncGenerator)
for await (const event of model.chatStream(
  [{ role: 'user', content: 'Write a haiku about coding.' }],
  { maxNewTokens: 512, temperature: 0.7 },
)) {
  if (!event.done) {
    process.stdout.write(event.text);
  } else {
    console.log('\nTokens:', event.numTokens);
  }
}

// Tool calling
import { createToolDefinition } from '@mlx-node/lm';

const tools = [
  createToolDefinition(
    'get_weather',
    'Get weather for a location',
    { location: { type: 'string', description: 'City name' } },
    ['location'],
  ),
];

const result = await model.chat(
  [{ role: 'user', content: 'What is the weather in Tokyo?' }],
  { tools, maxNewTokens: 2048 },
);
for (const call of result.toolCalls) {
  console.log(call.name, call.arguments);
}

How It Was Made

Converted from Qwen/Qwen3.5-9B official SafeTensors using mlx-node's conversion pipeline:

mlx convert \
  -i .cache/models/qwen3.5-9B \
  -o .cache/models/qwen3.5-9B-unsloth \
  -q --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

The --q-recipe unsloth flag applies the differential quantization strategy described above. The recipe defaults to 3-bit base (override with --q-bits). The --imatrix-path is required for the unsloth recipe — it applies AWQ-style channel pre-scaling before quantization using Unsloth's importance matrix. The conversion pipeline:

  1. Loads BF16 SafeTensors/GGUF weights via mmap (near-instant)
  2. Applies Qwen3.5-specific weight sanitization (norm +1.0 shift, dtype handling)
  3. Applies imatrix AWQ pre-scaling: important input channels are amplified (more quantization-resistant) while less important channels are shrunk, with inverse scales fused into preceding layer norms
  4. Runs the Unsloth recipe predicate to classify each tensor
  5. Quantizes attn q/k/v + SSM in_proj to 5-bit (AWQ-corrected), MLP gate/up to 3-bit, down to 4-bit, embed to 5-bit, lm_head to 6-bit
  6. Skips o_proj, out_proj, norms, A_log, conv1d, and low-rank projections (kept BF16)
  7. Writes single-file SafeTensors with per-layer quantization metadata in config.json

Unsloth's imatrix uses long-context chat, coding, and tool-calling calibration data rather than standard Wikipedia-512 contexts. From Unsloth's findings: imatrix "definitely helps reduce KLD & PPL" across all bit-widths, and is especially impactful at lower bits (3-bit and below).

Files

File Size Description
model.safetensors 6.4 GB Mixed-precision model weights
config.json 30 KB Model config + per-layer quantization overrides
tokenizer.json 12 MB HuggingFace tokenizer (248K vocab)
tokenizer_config.json 16 KB Tokenizer settings + Jinja2 chat template
vocab.json 6.4 MB Vocabulary mapping
merges.txt 3.2 MB BPE merges

Chat Template

The official Qwen3.5 chat template is preserved unmodified, supporting:

  • Multi-turn conversation
  • System messages
  • Tool calling (<tool_call> / </tool_call> tags)
  • Chain-of-thought reasoning (<think> / </think> tags)
  • Image/video content placeholders (for VLM variants)

Template compatibility fix: The official Qwen3.5 template uses raise_exception() for input validation (8 call sites), which is not a built-in function in most Jinja2-compatible renderers. Unsloth identified and fixed chat template issues affecting tool-calling across all Qwen3.5 variants. mlx-node takes a complementary approach — rather than patching the template, we register raise_exception as a native function in our minijinja renderer, so the official template works as-is without modification.

Acknowledgments

  • Unsloth (GitHub) — The quantization strategy in this model is directly based on Unsloth's per-layer KL divergence benchmarks and their Dynamic 2.0 quantization methodology. Their work on imatrix calibration with long-context chat and tool-calling data, and their systematic analysis of layer sensitivity in hybrid GatedDeltaNet architectures, made this recipe possible. We also use their published imatrix GGUF files for AWQ pre-scaling when converting from GGUF sources.
  • Qwen Team — For the Qwen3.5 model family and the hybrid attention architecture
  • Apple MLX — For the Metal-accelerated ML framework powering inference

License

This model inherits the Apache 2.0 license from the base Qwen3.5-9B model.

Downloads last month
2,160
Safetensors
Model size
2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brooooooklyn/Qwen3.5-9B-unsloth-mlx

Finetuned
Qwen/Qwen3.5-9B
Quantized
(149)
this model

Collection including Brooooooklyn/Qwen3.5-9B-unsloth-mlx