Raon-Speech-9B-AWQ-INT4

Raon-Speech Logo

Homepage GitHub
Hugging Face X
License

Technical Report | Blog (Coming soon)

RAON-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.

Key Features

  • End-to-End Speech Language Model: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
  • Bilingual Support: State-of-the-art speech understanding, answering, and generation in both English and Korean.
  • Multi-Task Capabilities: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model.
  • Speaker Voice Conditioning: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
  • TTS Continuation: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
  • Multi-Reward DPO Post-Training: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation.
  • HuggingFace Transformers Integration: Load and run directly via AutoModel.from_pretrained with trust_remote_code=True — no custom package installation required.

Benchmark Results

Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.

BF16 FP8 INT4 (AWQ)
RTF 0.45 0.46 0.41
TTFT 887 ms 904 ms 827 ms
TBT 233 ms 263 ms 242 ms
Checkpoint Size 18.1 GB 11.2 GB 7.8 GB
  • RTF (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
  • TTFT (Time to First Token): Latency until the first audio chunk is returned.
  • TBT (Time Between Tokens): Average interval between consecutive audio chunks.
  • INT4 (AWQ) offers the fastest inference and smallest checkpoint at the cost of a higher TBT.

Requirements

pip install transformers>=4.57.1 torch torchaudio soundfile accelerate

# Optional
pip install speechbrain  # for TTS with speaker voice conditioning
pip install gradio       # for Gradio demo

Quick Start

Option 1: Load from Hub (recommended)

No pip install raon needed.

from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module

MODEL_ID = "KRAFTON/Raon-Speech-9B-AWQ-INT4"

_cfg = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
    "modeling_raon.RaonPipeline",
    MODEL_ID,
    revision=getattr(_cfg, "_commit_hash", None),
)
del _cfg

pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Option 2: With raon package installed

git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e .  # or: uv sync
from raon import RaonPipeline

# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B-AWQ-INT4")

# From local path
pipe = RaonPipeline("/path/to/raon-model")

Tasks

STT (Audio → Text)

text = pipe.stt("audio.wav")

TTS (Text → Audio)

# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")

# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")

TextQA (Text + Audio → Text)

answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")

SpeechChat (Audio → Text)

answer = pipe.speech_chat("question.wav")

Chat (Multimodal)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "audio.wav"},
            {"type": "text", "text": "Transcribe and summarise this audio."},
        ],
    },
]
response = pipe.chat(messages)

Deployment (vLLM-Omni)

1. Clone & Build

git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .

2. Serve

docker run --rm --gpus all \
  --shm-size=16g \
  -p 8000:8000 \
  vllm-omni \
  bash -c "vllm serve KRAFTON/Raon-Speech-9B-AWQ-INT4 --omni --port 8000 --trust-remote-code --quantization awq --dtype float16"

3. Test — TTS

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
    "response_format": "wav"
  }' --output output.wav

4. Test — TTS with voice cloning

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
    "ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
    "task_type": "Base",
    "response_format": "wav"
  }' --output cloned.wav

5. Test — STT

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
          {"type": "text", "text": "Transcribe the audio into text."}
        ]
      }
    ]
  }'

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

© 2026 KRAFTON

Downloads last month
128
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KRAFTON/Raon-Speech-9B-AWQ-INT4