Qwen-3.5-unsloth-mlx
Collection
AWQ-style pre-scaling using Unsloth's imatrix calibration data, then 3-6-bit affine quantization with the Unsloth mixed-precision recipe via MLX • 20 items • Updated • 17
3-bit base mixed-precision quantization of Qwen/Qwen3.5-9B for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.
| Original (BF16) | This Model | |
|---|---|---|
| Size | ~18 GB | 6 GB |
| Format | SafeTensors (sharded) | SafeTensors (single file) |
| Precision | BF16 uniform | Mixed 3/4/5/6/8-bit + BF16 |
| Repo | GGUF Equivalent | Size | Decode (tok/s) | Speedup vs BF16 |
|---|---|---|---|---|
| Brooooooklyn/Qwen3.5-9B-UD-Q2_K_XL-mlx | UD-Q2_K_XL | 5 GB | TBD | TBD |
| Brooooooklyn/Qwen3.5-9B-UD-Q3_K_XL-mlx | UD-Q3_K_XL | 6 GB | TBD | TBD |
| Brooooooklyn/Qwen3.5-9B-UD-Q4_K_XL-mlx | UD-Q4_K_XL | 8 GB | TBD | TBD |
| Brooooooklyn/Qwen3.5-9B-UD-Q5_K_XL-mlx | UD-Q5_K_XL | 9 GB | TBD | TBD |
| Brooooooklyn/Qwen3.5-9B-UD-Q6_K_XL-mlx | UD-Q6_K_XL | 9 GB | TBD | TBD |
| Brooooooklyn/Qwen3.5-9B-UD-Q8_K_XL-mlx | UD-Q8_K_XL | 10 GB | TBD | TBD |
Benchmarked on Apple M3 Max 128GB, multi-turn chat (Turn 4 decode, steady-state).
| Weight | Bits | Rationale |
|---|---|---|
embed_tokens |
5-bit | KLD ~0.15 — very low sensitivity |
lm_head |
6-bit | KLD ~0.05 — safest tensor |
self_attn.q/k/v_proj |
5-bit + AWQ | KLD ~1.5–2.9, AWQ via layernorm |
linear_attn.in_proj_qkv/z |
5-bit + AWQ | KLD ~2.9, AWQ via layernorm |
self_attn.o_proj |
bf16 | NOT AWQ-correctable |
linear_attn.out_proj |
bf16 | KLD ~6.0 — worst tensor |
down_proj |
4-bit | "Slightly more sensitive" |
gate_proj, up_proj |
3-bit | "Generally ok" at low bits |
Based on Unsloth Dynamic 2.0 per-tensor KLD analysis. Sensitive layers get higher bits with AWQ correction, while FFN weights are aggressively quantized. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead).
AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 5-bit via input_layernorm. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16.
import { loadModel } from '@mlx-node/lm';
const model = await loadModel('./Qwen3.5-9B-UD-Q3_K_XL-mlx');
const result = await model.chat(
[{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }],
{ maxNewTokens: 2048, temperature: 0.6, enableThinking: false },
);
console.log(result.text);
mlx convert \
-i Qwen3.5-9B \
-o Qwen3.5-9B-UD-Q3_K_XL-mlx \
-q --q-bits 3 --q-recipe unsloth \
--imatrix-path imatrix_unsloth.gguf
Apache 2.0 (inherited from base model).
3-bit