Raon-Speech-9B-AWQ-INT4

Technical Report | Blog (Coming soon)

RAON-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.

Key Features

End-to-End Speech Language Model: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
Bilingual Support: State-of-the-art speech understanding, answering, and generation in both English and Korean.
Multi-Task Capabilities: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model.
Speaker Voice Conditioning: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
TTS Continuation: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
Multi-Reward DPO Post-Training: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation.
HuggingFace Transformers Integration: Load and run directly via AutoModel.from_pretrained with trust_remote_code=True — no custom package installation required.

Benchmark Results

Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.

	BF16	FP8	INT4 (AWQ)
RTF	0.45	0.46	0.41
TTFT	887 ms	904 ms	827 ms
TBT	233 ms	263 ms	242 ms
Checkpoint Size	18.1 GB	11.2 GB	7.8 GB

RTF (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
TTFT (Time to First Token): Latency until the first audio chunk is returned.
TBT (Time Between Tokens): Average interval between consecutive audio chunks.
INT4 (AWQ) offers the fastest inference and smallest checkpoint at the cost of a higher TBT.

Requirements

pip install transformers>=4.57.1 torch torchaudio soundfile accelerate

# Optional
pip install speechbrain  # for TTS with speaker voice conditioning
pip install gradio       # for Gradio demo

Quick Start

Option 1: Load from Hub (recommended)

No pip install raon needed.

from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module

MODEL_ID = "KRAFTON/Raon-Speech-9B-AWQ-INT4"

_cfg = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
    "modeling_raon.RaonPipeline",
    MODEL_ID,
    revision=getattr(_cfg, "_commit_hash", None),
)
del _cfg

pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Option 2: With raon package installed

git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e .  # or: uv sync

from raon import RaonPipeline

# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B-AWQ-INT4")

# From local path
pipe = RaonPipeline("/path/to/raon-model")

Tasks

STT (Audio → Text)

text = pipe.stt("audio.wav")

TTS (Text → Audio)

# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")

# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")

TextQA (Text + Audio → Text)

answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")

SpeechChat (Audio → Text)

answer = pipe.speech_chat("question.wav")

Chat (Multimodal)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "audio.wav"},
            {"type": "text", "text": "Transcribe and summarise this audio."},
        ],
    },
]
response = pipe.chat(messages)

Deployment (vLLM-Omni)

1. Clone & Build

git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .

2. Serve

docker run --rm --gpus all \
  --shm-size=16g \
  -p 8000:8000 \
  vllm-omni \
  bash -c "vllm serve KRAFTON/Raon-Speech-9B-AWQ-INT4 --omni --port 8000 --trust-remote-code --quantization awq --dtype float16"

3. Test — TTS

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
    "response_format": "wav"
  }' --output output.wav

4. Test — TTS with voice cloning

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
    "ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
    "task_type": "Base",
    "response_format": "wav"
  }' --output cloned.wav

5. Test — STT

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
          {"type": "text", "text": "Transcribe the audio into text."}
        ]
      }
    ]
  }'