Instructions to use AutoArk-AI/ARK-ASR-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ARK-ASR-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
TL;DR ARK-ASR-0.6B is a 0.6B-parameter automatic speech recognition model trained with teacher-data adaptation and on-policy distillation. The accompanying training, inference, and evaluation code is available at AutoArk/open-audio-opd.
Abstract
ARK-ASR is an audio ASR student model optimized with the teacher-data adaptation + online policy distillation (TD + OPD) recipe from open-audio-opd.
Instead of relying only on static supervised transcripts, OPD lets the student generate transcripts online and trains it against token-level teacher scores on the student's own generated behavior. This checkpoint corresponds to the Ark-Base+TD+OPD (0.6B) model reported in the open-audio-opd results.
ARK-ASR currently supports Chinese, English, German, Japanese, French, and Korean ASR.
Model Overview
Figure 1: ARK-ASR architecture. Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen2 decoder by replacing audio placeholder token embeddings before transcript generation.
- Model size: 0.6B parameters
- Task: automatic speech recognition
- Architecture: audio-capable autoregressive Transformers model with custom
arkasrremote code - Checkpoint format:
safetensors - Sampling rate: 16 kHz
- Recommended inference code:
scripts/infer/ark_asr_transformers.py
The model should be loaded with trust_remote_code=True. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering.
Performance
The following results are from the open-audio-opd evaluation. Lower CER/WER is better. Bold numbers mark the best result within the 0.6B group.
| Model | aishell-1 (CER) | Wenet-meeting (CER) | Wenet-net (CER) | Libri-clean (WER) | Libri-other (WER) |
|---|---|---|---|---|---|
| 0.6B models | |||||
| Ark-Base (0.6B) | 3.48% | 10.22% | 7.74% | 3.75% | 7.17% |
| Ark-Base+OPD (0.6B) | 3.00% | 7.18% | 6.13% | 2.88% | 5.50% |
| Ark-Base+TD+OPD (0.6B) | 1.95% | 5.92% | 5.39% | 2.45% | 4.56% |
| Qwen3-ASR-0.6B | 2.07% | 5.57% | 5.45% | 2.81% | 5.05% |
| Larger reference model | |||||
| Qwen3-ASR-1.7B | 1.50% | 4.69% | 4.55% | 2.20% | 4.05% |
Ark-Base is the 0.6B supervised ASR checkpoint trained on 100k hours of ASR audio. TD denotes teacher-data adaptation using 2,000 hours of teacher-generated ASR data. OPD denotes on-policy distillation with a Qwen-ASR teacher.
Inference
Run ASR inference with Hugging Face Transformers:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_path = "AutoArk-AI/ARK-ASR-0.6B"
audio_path = "assets/libai.wav"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch_dtype,
attn_implementation="sdpa",
).to(device)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": audio_path},
{"type": "text", "text": "Please transcribe this audio."},
],
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
)
inputs = inputs.to(device)
if "audios" in inputs:
inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id]
outputs = model.generate(
**inputs,
do_sample=False,
max_new_tokens=256,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
bad_words_ids=bad_words_ids,
)
decoded_outputs = tokenizer.batch_decode(
outputs[:, inputs.input_ids.shape[1] :],
skip_special_tokens=True,
)
print(decoded_outputs)
For batch JSONL inference, use the open-source inference code:
git clone https://github.com/AutoArk/open-audio-opd
cd open-audio-opd
pip install -e .
The input JSONL should contain one ASR sample per line:
{"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
python scripts/infer/ark_asr_transformers.py \
--input /path/to/input.jsonl \
--output runs/infer/predictions.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpa
The output JSONL preserves input metadata and adds:
pred_text: cleaned prediction text for downstream evaluationpred_text_raw: raw decoded generation before cleanup
Evaluation
The repository also includes a J/WER evaluation entrypoint:
python scripts/eval/eval_jwer_ark_asr_transformers.py \
--input /path/to/test.jsonl \
--output runs/eval/result.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpa
No evaluation audio or dataset files are bundled with this model repository.
Acknowledgements
The training code is based on THUNLP/OPD and verl. The OPD recipe uses a stronger ASR teacher to score online student rollouts.
Citation
If you find ARK-ASR or open-audio-opd useful, please cite or link the project repository:
@misc{open_audio_opd_ark_asr,
title = {open-audio-opd: Industrial ASR Online Policy Distillation Training Code},
author = {AutoArk AI},
year = {2026},
howpublished = {\url{https://github.com/AutoArk/open-audio-opd}}
}
- Downloads last month
- -