Qwen3.5-9B-q4f16_1-MLC

This is the Qwen3.5-9B model in MLC format q4f16_1.

Qwen3.5 is a hybrid architecture: 75% GatedDeltaNet recurrent linear attention layers, 25% standard GQA softmax attention layers. This requires the kHybrid KVStateKind in MLC-LLM which manages both PagedKVCache and RNNState simultaneously.

Compiled with mlc-llm using the hybrid KVStateKind branch.

Usage

Python API

from mlc_llm import MLCEngine

model = "HF://Mitiskuma/Qwen3.5-9B-q4f16_1-MLC"
engine = MLCEngine(model, device="metal")

for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print()

engine.terminate()

Chat CLI

mlc_llm chat HF://Mitiskuma/Qwen3.5-9B-q4f16_1-MLC

Model Details

Parameter	Value
Base model	Qwen3.5-9B
Architecture	Qwen3.5 GatedDeltaNet (hybrid recurrent + attention)
Quantization	q4f16_1
KV state kind	hybrid (PagedKVCache + RNNState)
Context window	1024 (compile-time setting)
Conversation template	chatml

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mitiskuma/Qwen3.5-9B-q4f16_1-MLC

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(146)

this model