Text-to-Speech
Transformers
Safetensors
omnivoice
tts
singing
emotion
expressive-tts
multilingual
voice-cloning
Instructions to use ModelsLab/omnivoice-singing with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModelsLab/omnivoice-singing with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="ModelsLab/omnivoice-singing")# Load model directly from transformers import OmniVoice model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - zh | |
| - ja | |
| - ko | |
| - es | |
| - fr | |
| - de | |
| - it | |
| - ru | |
| - hi | |
| - gu | |
| library_name: transformers | |
| pipeline_tag: text-to-speech | |
| license: apache-2.0 | |
| base_model: k2-fsa/OmniVoice | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - singing | |
| - emotion | |
| - expressive-tts | |
| - multilingual | |
| - voice-cloning | |
| - omnivoice | |
| # OmniVoice β Singing + Emotion Finetune | |
| A finetune of [`k2-fsa/OmniVoice`](https://huggingface.co/k2-fsa/OmniVoice) that adds: | |
| - **`[singing]` tag** β sung speech / nursery-style melodic vocals | |
| - **Emotion tags** β `[happy]`, `[sad]`, `[angry]`, `[excited]`, `[calm]`, `[nervous]`, `[whisper]` | |
| - **Combined tags** β e.g. `[singing] [happy] ...` or `[singing] [sad] ...` | |
| Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are **preserved** β the base speech head was protected during finetuning with a continuity mix of plain speech and singing. | |
| ## Drop-in replacement | |
| This checkpoint is fully compatible with the upstream [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) code β same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id: | |
| ```python | |
| from omnivoice.models.omnivoice import OmniVoice | |
| model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval() | |
| # Normal speech (unchanged behavior) | |
| audios = model.generate( | |
| text="The quick brown fox jumps over the lazy dog.", | |
| language="English", | |
| ) | |
| # Singing | |
| audios = model.generate( | |
| text="[singing] Twinkle twinkle little star, how I wonder what you are.", | |
| language="English", | |
| ) | |
| # Emotional speech | |
| audios = model.generate( | |
| text="[happy] I just got the best news of my entire year!", | |
| language="English", | |
| ) | |
| # Combined | |
| audios = model.generate( | |
| text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.", | |
| language="English", | |
| ) | |
| import soundfile as sf | |
| sf.write("out.wav", audios[0], model.sampling_rate) | |
| ``` | |
| CLI works the same way: | |
| ```bash | |
| omnivoice-infer --model ModelsLab/omnivoice-singing \ | |
| --text "[happy] Hello there, how wonderful to see you today!" \ | |
| --language English \ | |
| --output out.wav | |
| ``` | |
| ## Supported tags | |
| | Tag | Source data | Strength | | |
| |---|---|---| | |
| | `[singing]` | GTSinger English (6,755 clips, ~8 h) | strong | | |
| | `[happy]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong | | |
| | `[sad]` | CREMA-D + RAVDESS + Expresso (~2900 clips) | strong | | |
| | `[angry]` | CREMA-D + RAVDESS (~1500 clips) | strong | | |
| | `[nervous]` | CREMA-D fear + RAVDESS fearful (~1400 clips) | strong | | |
| | `[whisper]` | Expresso whisper (~1500 clips) | strong | | |
| | `[calm]` | RAVDESS calm (~190 clips) | weak β limited data | | |
| | `[excited]` | RAVDESS surprised (~190 clips) | weak β limited data | | |
| Guidance scale of **3.0** (up from default 2.0) is recommended to make tag behavior more pronounced: | |
| ```python | |
| audios = model.generate( | |
| text="[happy] Welcome!", | |
| language="English", | |
| guidance_scale=3.0, | |
| ) | |
| ``` | |
| ## What's preserved from the base | |
| - Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.) | |
| - Voice cloning from reference audio (`ref_audio` / `ref_text` args) | |
| - Voice design via `instruct` parameter (pitch / gender / age / accent attributes) | |
| - Fine-grained pronunciation control (pinyin / CMU phoneme overrides) | |
| - Speed and duration control (`speed` / `duration` args) | |
| - Built-in non-verbal symbols (`[laughter]`, `[sigh]`, etc.) | |
| ## Training | |
| Two-stage finetune from `k2-fsa/OmniVoice`: | |
| **Stage 1 β Singing** (2500 steps): | |
| - GTSinger English (6.8k clips, tagged `[singing] {lyrics}`) | |
| - LibriTTS-R dev+test clean (10k clips, plain text β speech preservation) | |
| - LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192 | |
| - Final eval loss: **4.74** | |
| **Stage 2 β Emotion** (2500 steps, forked from singing/checkpoint-2500): | |
| - CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips) | |
| - 1.5k singing + 1.5k speech continuity samples | |
| - LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192 | |
| - Best eval loss: **4.72** (at step 750) / final **4.88** (step 2500 β this checkpoint, found to sound better subjectively) | |
| This published checkpoint is the **final emotion step 2500**, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality. | |
| ## Known limitations | |
| - `[calm]` and `[excited]` had only ~190 training samples each (only one dataset contributed) β behavior is weaker than the other emotion tags. | |
| - Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation β works but quality varies. | |
| - Like the base model, output quality is bounded by the **HiggsAudioV2 tokenizer** (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design. | |
| ## License | |
| Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets: | |
| - GTSinger: CC BY-NC-SA 4.0 (research use) | |
| - CREMA-D: ODbL | |
| - RAVDESS: CC BY-NC-SA 4.0 | |
| - Expresso: CC BY-NC 4.0 | |
| - LibriTTS-R: CC BY 4.0 | |
| ## Acknowledgements | |
| - [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) β base model & training framework | |
| - [HiggsAudioV2](https://huggingface.co/bosonai/higgs-audio-v2-tokenizer) β discrete audio tokenizer | |
| - Qwen team β Qwen3-0.6B backbone | |
| - Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams | |