Qwen3.5-27B — Unsloth Dynamic AWQ (3-bit, mlx-node)
Mixed-precision 3/4/5/6-bit quantization of Qwen/Qwen3.5-27B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.
| Original (BF16) | This Model | |
|---|---|---|
| Size | 52 GB | 15 GB |
| Format | SafeTensors (sharded) | SafeTensors (single file) |
| Precision | BF16 uniform | Mixed 3/4/5/6-bit |
| Reduction | — | 72% |
Performance
Tested on Apple Silicon M3 Max 128GB with mlx-node:
| Model | Size | Decode (tok/s) | Speedup |
|---|---|---|---|
| BF16 (unquantized) | 52 GB | 5.6–6.6 | baseline |
| This model (Unsloth, 3-bit base) | 15 GB | 20.1–20.5 | ~3.3x faster |
Decode is memory-bandwidth bound on Apple Silicon — fewer bytes to transfer per token directly translates to higher throughput. Embeddings and lm_head stay quantized in memory (5/6-bit) and use quantized_matmul on forward — no dequantize-at-load overhead. Attention q/k/v and SSM input projections are quantized at 5-bit with imatrix AWQ pre-scaling for near-lossless quality. Attention o_proj and SSM out_proj are kept at bf16 (no preceding norm for AWQ correction).
Quantization Strategy
This model's quantization recipe is based on the Unsloth team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as Unsloth Dynamic 2.0 — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.
Why Qwen3.5 Needs Special Treatment
Qwen3.5 is not a standard transformer. It uses a hybrid architecture: 48 GatedDeltaNet linear attention layers + 16 standard full attention layers (full_attention_interval=4). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has fundamentally different sensitivity profiles across layer types:
- Quantizing
ssm_out(linear_attn.out_proj) at Q2_K "does dramatically worse" — KLD spikes far beyond other components - Attention tensors (
self_attn.*) are "especially sensitive for hybrid architectures" — more so than in pure-attention models like LLaMA - Attention gates (
linear_attn.in_proj_z) — MXFP4 "performs poorly" on these - FFN gate/up projections are "generally ok to quantize to 3-bit" — the only layers that tolerate aggressive compression well
- FFN
down_projis "slightly more sensitive" than gate/up — benefits from an extra bit
The key insight from Unsloth's work: it's better to quantize sensitive layers at higher bits (5-bit with imatrix) and aggressively quantize the rest (3-bit), than to uniformly quantize everything at a middling bit-width. Their Dynamic models consistently sit on the Pareto frontier for 99.9% KL divergence vs model size, outperforming uniform quantization at every size point.
Unsloth imatrix: The Calibration Foundation
The second pillar of Unsloth's approach is their importance matrix (imatrix) — per-channel calibration data that tells the quantizer which channels within each tensor carry the most information.
Standard imatrix calibrations (used by most GGUF quantizers) run the model on Wikipedia-512 — short encyclopedia passages. Unsloth instead calibrates on long-context chat, coding, and tool-calling data, which better represents how these models are actually used. From Unsloth's findings:
- "Imatrix definitely helps reduce KLD & PPL" across all bit-widths
- "Imatrix generally helps on lower bits, and works on all quants and bit widths"
- SSM output at 2-bits was "really bad" without imatrix, but imatrix "reduces the 99.9% KLD by a lot"
- Trade-off: I-quants make "inference 5-10% slower", but the quality gain is substantial
When an imatrix is provided to mlx-node's conversion pipeline, it applies AWQ-style channel pre-scaling before quantization: important input channels (high activation magnitude) are amplified to make them more quantization-resistant, while less important channels are shrunk. The inverse scales are fused into preceding layer norms, so there is zero inference overhead — the quality improvement is free at runtime.
Per-Layer Decisions
Based on Unsloth's per-tensor 99.9% KLD analysis (sorted by sensitivity, worst → best):
| Component | Precision | Count | Unsloth Finding (99.9% KLD) |
|---|---|---|---|
self_attn.{q,k,v}_proj |
5-bit affine (gs=64) + AWQ | 48 tensors | KLD ~1.5–2.9 — "Especially sensitive for hybrid architectures"; AWQ-corrected via input_layernorm |
self_attn.o_proj |
BF16 (skip) | 16 tensors | KLD ~1.5 — no preceding norm for AWQ correction |
linear_attn.in_proj_qkv |
5-bit affine (gs=64) + AWQ | 48 tensors | KLD ~2.9 — SSM input projection; AWQ-corrected via input_layernorm |
linear_attn.in_proj_z |
5-bit affine (gs=64) + AWQ | 48 tensors | KLD ~1.5 — "Performs poorly with MXFP4"; AWQ-corrected via input_layernorm |
linear_attn.out_proj |
BF16 (skip) | 48 tensors | KLD ~6.0 at q2_k — worst tensor; no preceding norm for AWQ correction |
linear_attn.A_log |
BF16 (skip) | 48 tensors | State-space dynamics — not quantizable |
linear_attn.conv1d |
BF16 (skip) | 48 tensors | KLD ~0.05 — too small to quantize meaningfully |
linear_attn.in_proj_{a,b} |
BF16 (skip) | 96 tensors | Low-rank projections — too small |
mlp.down_proj |
4-bit affine (gs=64) | 64 tensors | "Slightly more sensitive" than gate/up |
mlp.gate_proj |
3-bit affine (gs=64) | 64 tensors | "Generally ok to quantize to 3-bit" |
mlp.up_proj |
3-bit affine (gs=64) | 64 tensors | "Generally ok to quantize to 3-bit" |
embed_tokens |
5-bit affine (gs=64) | 1 tensor | KLD ~0.15 at q5_k — among least sensitive |
lm_head |
6-bit affine (gs=64) | 1 tensor | KLD ~0.05 at q5_k — safest tensor to quantize |
| Norms | BF16 (skip) | ~260 tensors | Never quantized (standard practice) |
AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 5-bit with imatrix AWQ pre-scaling via input_layernorm. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer, so AWQ cannot be applied. imatrix is required for the unsloth recipe.
Comparison with Unsloth GGUF (UD-Q3_K_XL)
| Tensor | Unsloth UD-Q3_K_XL | Ours | Gap |
|---|---|---|---|
| attn q/k/v | Q5_K + imatrix | 5-bit affine + AWQ | Small (AWQ compensates) |
| in_proj_qkv/z | Q5_K + imatrix | 5-bit affine + AWQ | Small |
| o_proj | Q5_K + imatrix | bf16 | We're larger but lossless |
| out_proj | Q5_K + imatrix | bf16 | We're larger but lossless |
| FFN gate/up | Q3_K + imatrix | 3-bit affine + AWQ | Moderate (K-quant > affine at 3-bit) |
| FFN down | Q4_K + imatrix | 4-bit affine + AWQ | Small |
Architecture
Qwen3.5-27B is a decoder-only transformer with a hybrid attention design:
| Parameter | Value |
|---|---|
| Hidden size | 5,120 |
| Layers | 64 (48 linear + 16 full attention) |
| Attention heads | 24 (4 KV heads, GQA 6:1) |
| Head dimension | 256 |
| Intermediate size | 17,408 |
| Vocab size | 248,320 |
| Max context | 262,144 tokens |
| Activation | SiLU |
Layer pattern (repeating): [linear, linear, linear, full, linear, linear, linear, full, ...]
- Linear attention layers use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
- Full attention layers use standard grouped-query attention with KV caching
Usage
With mlx-node (TypeScript/JavaScript)
import { loadModel } from '@mlx-node/lm';
const model = await loadModel('./qwen3.5-27b-unsloth');
// Chat (single-shot)
const result = await model.chat(
[{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }],
{ maxNewTokens: 2048, temperature: 0.6, enableThinking: false },
);
console.log(result.text);
// Streaming (AsyncGenerator)
for await (const event of model.chatStream(
[{ role: 'user', content: 'Write a haiku about coding.' }],
{ maxNewTokens: 512, temperature: 0.7 },
)) {
if (!event.done) {
process.stdout.write(event.text);
} else {
console.log('\nTokens:', event.numTokens);
}
}
// Tool calling
import { createToolDefinition } from '@mlx-node/lm';
const tools = [
createToolDefinition(
'get_weather',
'Get weather for a location',
{ location: { type: 'string', description: 'City name' } },
['location'],
),
];
const result = await model.chat(
[{ role: 'user', content: 'What is the weather in Tokyo?' }],
{ tools, maxNewTokens: 2048 },
);
for (const call of result.toolCalls) {
console.log(call.name, call.arguments);
}
How It Was Made
Converted from Qwen/Qwen3.5-27B official SafeTensors using mlx-node's conversion pipeline:
mlx convert \
-i .cache/models/qwen3.5-27B \
-o .cache/models/qwen3.5-27b-unsloth \
-q --q-recipe unsloth \
--imatrix-path imatrix_unsloth.gguf
The --q-recipe unsloth flag applies the differential quantization strategy described above. The recipe defaults to 3-bit base (override with --q-bits). The --imatrix-path is required for the unsloth recipe — it applies AWQ-style channel pre-scaling before quantization using Unsloth's importance matrix. The conversion pipeline:
- Loads BF16 SafeTensors/GGUF weights via mmap (near-instant)
- Applies Qwen3.5-specific weight sanitization (norm +1.0 shift, dtype handling)
- Applies imatrix AWQ pre-scaling: important input channels are amplified (more quantization-resistant) while less important channels are shrunk, with inverse scales fused into preceding layer norms
- Runs the Unsloth recipe predicate to classify each tensor
- Quantizes attn q/k/v + SSM in_proj to 5-bit (AWQ-corrected), MLP gate/up to 3-bit, down to 4-bit, embed to 5-bit, lm_head to 6-bit
- Skips o_proj, out_proj, norms, A_log, conv1d, and low-rank projections (kept BF16)
- Writes single-file SafeTensors with per-layer quantization metadata in
config.json
Unsloth's imatrix uses long-context chat, coding, and tool-calling calibration data rather than standard Wikipedia-512 contexts. From Unsloth's findings: imatrix "definitely helps reduce KLD & PPL" across all bit-widths, and is especially impactful at lower bits (3-bit and below).
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
15 GB | Mixed-precision model weights |
config.json |
73 KB | Model config + per-layer quantization overrides |
tokenizer.json |
12 MB | HuggingFace tokenizer (248K vocab) |
tokenizer_config.json |
16 KB | Tokenizer settings + Jinja2 chat template |
vocab.json |
6.4 MB | Vocabulary mapping |
merges.txt |
3.2 MB | BPE merges |
Chat Template
The official Qwen3.5 chat template is preserved unmodified, supporting:
- Multi-turn conversation
- System messages
- Tool calling (
<tool_call>/</tool_call>tags) - Chain-of-thought reasoning (
<think>/</think>tags) - Image/video content placeholders (for VLM variants)
Template compatibility fix: The official Qwen3.5 template uses raise_exception() for input validation (8 call sites), which is not a built-in function in most Jinja2-compatible renderers. Unsloth identified and fixed chat template issues affecting tool-calling across all Qwen3.5 variants. mlx-node takes a complementary approach — rather than patching the template, we register raise_exception as a native function in our minijinja renderer, so the official template works as-is without modification.
Acknowledgments
- Unsloth (GitHub) — The quantization strategy in this model is directly based on Unsloth's per-layer KL divergence benchmarks and their Dynamic 2.0 quantization methodology. Their work on imatrix calibration with long-context chat and tool-calling data, and their systematic analysis of layer sensitivity in hybrid GatedDeltaNet architectures, made this recipe possible. We also use their published imatrix GGUF files for AWQ pre-scaling when converting from GGUF sources.
- Qwen Team — For the Qwen3.5 model family and the hybrid attention architecture
- Apple MLX — For the Metal-accelerated ML framework powering inference
License
This model inherits the Apache 2.0 license from the base Qwen3.5-27B model.
- Downloads last month
- 1,667
3-bit
Model tree for Brooooooklyn/Qwen3.5-27B-unsloth-mlx
Base model
Qwen/Qwen3.5-27B