Raon-Speech-9B-AWQ-INT4
Technical Report | Blog (Coming soon)
RAON-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.
Key Features
- End-to-End Speech Language Model: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
- Bilingual Support: State-of-the-art speech understanding, answering, and generation in both English and Korean.
- Multi-Task Capabilities: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model.
- Speaker Voice Conditioning: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
- TTS Continuation: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
- Multi-Reward DPO Post-Training: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation.
- HuggingFace Transformers Integration: Load and run directly via
AutoModel.from_pretrainedwithtrust_remote_code=True— no custom package installation required.
Benchmark Results
Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.
| BF16 | FP8 | INT4 (AWQ) | |
|---|---|---|---|
| RTF | 0.45 | 0.46 | 0.41 |
| TTFT | 887 ms | 904 ms | 827 ms |
| TBT | 233 ms | 263 ms | 242 ms |
| Checkpoint Size | 18.1 GB | 11.2 GB | 7.8 GB |
- RTF (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
- TTFT (Time to First Token): Latency until the first audio chunk is returned.
- TBT (Time Between Tokens): Average interval between consecutive audio chunks.
- INT4 (AWQ) offers the fastest inference and smallest checkpoint at the cost of a higher TBT.
Requirements
pip install transformers>=4.57.1 torch torchaudio soundfile accelerate
# Optional
pip install speechbrain # for TTS with speaker voice conditioning
pip install gradio # for Gradio demo
Quick Start
Option 1: Load from Hub (recommended)
No pip install raon needed.
from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module
MODEL_ID = "KRAFTON/Raon-Speech-9B-AWQ-INT4"
_cfg = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
"modeling_raon.RaonPipeline",
MODEL_ID,
revision=getattr(_cfg, "_commit_hash", None),
)
del _cfg
pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")
Option 2: With raon package installed
git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e . # or: uv sync
from raon import RaonPipeline
# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B-AWQ-INT4")
# From local path
pipe = RaonPipeline("/path/to/raon-model")
Tasks
STT (Audio → Text)
text = pipe.stt("audio.wav")
TTS (Text → Audio)
# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")
# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")
TextQA (Text + Audio → Text)
answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")
SpeechChat (Audio → Text)
answer = pipe.speech_chat("question.wav")
Chat (Multimodal)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "audio.wav"},
{"type": "text", "text": "Transcribe and summarise this audio."},
],
},
]
response = pipe.chat(messages)
Deployment (vLLM-Omni)
1. Clone & Build
git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .
2. Serve
docker run --rm --gpus all \
--shm-size=16g \
-p 8000:8000 \
vllm-omni \
bash -c "vllm serve KRAFTON/Raon-Speech-9B-AWQ-INT4 --omni --port 8000 --trust-remote-code --quantization awq --dtype float16"
3. Test — TTS
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
"response_format": "wav"
}' --output output.wav
4. Test — TTS with voice cloning
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
"ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
"task_type": "Base",
"response_format": "wav"
}' --output cloned.wav
5. Test — STT
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "KRAFTON/Raon-Speech-9B-AWQ-INT4",
"messages": [
{
"role": "user",
"content": [
{"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
{"type": "text", "text": "Transcribe the audio into text."}
]
}
]
}'
License
This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
© 2026 KRAFTON
- Downloads last month
- 128