Qwen3.5-9B-q4f16_1-MLC
This is the Qwen3.5-9B model in MLC format q4f16_1.
Qwen3.5 is a hybrid architecture: 75% GatedDeltaNet recurrent linear attention layers, 25% standard GQA softmax attention layers. This requires the kHybrid KVStateKind in MLC-LLM which manages both PagedKVCache and RNNState simultaneously.
Compiled with mlc-llm using the hybrid KVStateKind branch.
Usage
Python API
from mlc_llm import MLCEngine
model = "HF://Mitiskuma/Qwen3.5-9B-q4f16_1-MLC"
engine = MLCEngine(model, device="metal")
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
print()
engine.terminate()
Chat CLI
mlc_llm chat HF://Mitiskuma/Qwen3.5-9B-q4f16_1-MLC
Model Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-9B |
| Architecture | Qwen3.5 GatedDeltaNet (hybrid recurrent + attention) |
| Quantization | q4f16_1 |
| KV state kind | hybrid (PagedKVCache + RNNState) |
| Context window | 1024 (compile-time setting) |
| Conversation template | chatml |
- Downloads last month
- 14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support