Automatic Speech Recognition
Transformers
Safetensors
Khmer
English
troryongasr
custom_code
KrorngAI

Hugging Face YouTube Channel Homepage Personal
License

TrorYongASR

This repository contains model weights and configuration files for the pre-trained model.

Model Details

Model Description

TrorYongASR is an Encoder-Decoder model for Automatic Speech Recognition (ASR) task. It is inspired by PARSeq and Whisper: the auditory-lingual decoder has only one transformer block.

TrorYongASR

TrorYongASR has 2 configurations:

Model Size Tiny Small
Parameters 29M 135M
Audio Encoder 4 layers, 6 heads 12 layers, 12 heads
Text Decoder 1 layer, 12 heads 1 layer, 24 heads
Embedding Dim 384 768
Audio Context 1500 1500
Text Context 1024 1024

Note: The audio array are processed to log-mel spectrogram with 80 mels (the same as Whisper models of the same size)

  • Developed by: KHUN Kimang (Ph.D.)
  • Shared by: KrorngAI
  • Model type: ASR (Automatic Speech Recognition)
  • Language(s) (NLP): Khmer and English

Model Sources

Evaluation

The evaluation assesses two capabilities — language detection and transcription — on two datasets (google/fleurs for Khmer and openslr/librispeech_asr for English). All results are from the test split of each dataset, representing the model's generalization ability to unseen data.

Testing Data

Dataset Language Testing examples Description
google/fleurs Khmer 765 Multi-lingual dataset with Khmer language samples
librispeech.clean English 2620 Clean speech dataset for English transcription

Note: Audios longer than 30 seconds are excluded from the evaluation (that is why google/fleurs has 765 examples instead of 771).

Metrics and Results

Language Detection

Language detection measures model’s capability to recognize the spoken language from audio input. Since TrorYongASR currently supports 2 languages, this task becomes binary classification task. Classic metrics are used:

  • Precision: Proportion of predicted languages that are correct
  • Recall : Proportion of actual language samples correctly identified
  • F1-score : Harmonic mean of precision and recall

Results:

Model Metrics Khmer (fleurs) English (librispeech.clean)
Tiny Precision 100% 100%
Recall 100% 100%
F1-score 100% 100%
Small Precision 100% 99%
Recall 96% 100%
F1-score 98% 99%

Tiny size achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. Small size performs slightly worst by tending to predict English language.

The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.

Transcription

For transcription task, 3 metrics below are used

  • Token Error Rate (TER) : Proportion of incorrectly transcribed tokens
  • Character Error Rate (CER) : Proportion of characters that are incorrect
  • Word Error Rate (WER) : Proportion of words that are incorrect

Token Error Rate (TER) measures model's capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn't account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.

Transcription Results:

Model Metric Khmer (fleurs) English (librispeech.clean) Mixed (Khmer + English)
Tiny WER 75.81% 54.33% 60.36%
CER 54.99% 42.41% 46.18%
TER 54% 17% 27%
Small WER 50.46% 21.75% 29.78%
CER 35.89% 16.58% 22.37%
TER 43% 8% 18%

Key Observations:

  • The tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER)
  • Performance drops significantly for Khmer (75.88% WER, 54.99% CER, 54% TER)
  • The small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER)
  • Performance for Khmer is moderate (50.46% WER, 35.89% CER, 43% TER)
  • The larger model benefits from increased embedding dimension (768 vs 384) and more layers for audio encoder (12 vs 4)

Note: To compute CER and WER, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, khmercut PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.

WER Comparison with Whisper:

Tiny Parameters Khmer (fleurs) English (librispeech.clean)
TrorYongASR 29M 75.88% 54.33%
Whisper 39M 100.6% 7.6%
Small Parameters Khmer (fleurs) English (librispeech.clean)
TrorYongASR 135M 50.46% 21.75%
Whisper 244M 104.4% 3.4%

Key Observations:

  • Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 135M for Small)
  • Whisper shows significantly lower word error rates on English (7.6% vs 54.33% for Tiny, 3.4% vs 12.95% for Small)
  • Whisper performs worse on Khmer (100.6% vs 75.88% for Tiny, 104.4% vs 50.46% for Small)
  • Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data

Note: WER data of Whisper is taken from their paper.

Result Summary

Language Detection: Both model sizes achieved great performance across all metrics (Precision, Recall, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This high score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.

Transcription: The Small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER) and moderate performance for Khmer (50.46% WER, 35.89% CER, 43% TER). The Tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER) but significantly lower performance for Khmer (75.88% WER, 54.99% CER, 54% TER). This shows that TrorYongASR can be scaled to get higher performance.

Note on Translation Task: The models are also trained for translation task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).

How to Get Started with the Model

First, install tror-yong-asr PyPI package:

pip install tror-yong-asr

Then, use the code below to get started with the model.

from transformers import AutoProcessor
from tror_yong_asr import TrorYongASRModel, transcribe, translate, detect_language


model_id = "KrorngAI/TrorYongASR-small"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = TrorYongASRModel.from_pretrained(model_id, trust_remote_code=True)

result1 = detect_language('/path/to/audio_file.mp3', model, processor)
print(result1)

result2 = transcribe('/path/to/audio_file.mp3', model, processor, max_tokens=64)
print(result2)

result3 = translate('/path/to/audio_file.mp3', model, processor, max_tokens=64)
print(result3)

Fine-tuning

Notebook (TBA)

Uses

Direct Use

The Tiny model can be used directly for:

  • Speech-to-text transcription: transcribe Khmer and English audio
  • Speech-to-text translation: translate Khmer audio to English text and English audio to Khmer text
  • Language detection: Identify whether audio is in Khmer or English (100% accuracy)
  • Edge computing: Deploy on mobile devices, IoT devices, and embedded systems due to its small size (29M parameters)
  • Real-time applications: Low latency inference suitable for real-time speech interfaces

Downstream Use [optional]

The model can be integrated into:

  • Mobile applications: Android/iOS apps with speech recognition
  • Web applications: Browser-based speech-to-text using WebAssembly
  • IoT devices: Smart speakers, voice assistants
  • Larger ASR systems: As a component in multi-language ASR pipelines

Bias, Risks, and Limitations

Technical Limitations:

  • No speech detection: The model was not trained for this task. User needs to fine-tune the model for this task (TrorYongASRTokenizer has <|nospeech|> token.)
  • Translate task: The training data for translation task is scarce. User needs to fine-tune the model for better translation performance
  • Noise robustness: Performance may degrade in noisy environments
  • No timestamp output: The model does not support timestamp output

Sociotechnical Limitations:

  • Accent variability: May not perform well on diverse Khmer accents
  • Background noise: Limited robustness to background noise and reverberation
  • Speaker variability: May struggle with different speaking styles and rates

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

To capture model's scalability, both tiny and small variants were trained using the same configuration detailed below.

Training Data

Transcription Task

For transcription task, the model was trained on around 140 hours of Khmer audio and around 100 hours of English audio. Khmer datasets include DDD-Cambodia/khm-asr-cultural (134.6 hours), openslr/openslr, and google/fleurs. Split clean.100 of openslr/librispeech_asr was used for English dataset.

Dataset Language Training examples Validation examples Description
DDD-Cambodia/khm-asr-cultural Khmer 56716 0 Khmer ASR Cultural Dataset (split train)
openslr/openslr Khmer 2906 0 Multi-speaker TTS data for Khmer language (split SLR42)
google/fleurs Khmer 1675 324 TTS data for Khmer language (split km_kh)
librispeech_asr.clean English 28539 2703 Clean speech dataset for English transcription

Translation Task

For translation task, the data was scarce: only 2000 examples for Khmer audio to English text, and only 1000 examples for English audio to Khmer text.

Training Procedure

Preprocessing

Following Whisper model of openai, audios with duration longer than 30 seconds are filtered out. All audios have 16000 sample rate. For English dataset, all texts are in lowercase.

Training Hyperparameters

  • Training regime: 16-mixed precision training using LightningAI package
  • Optimizer: MuonAdamW (custom implementation)
  • Learning rate: Linear Warmup (38 optimizer steps) + CosineAnnealing (3774 optimizer steps)
  • Weight decay: 0.1
  • Effective Batch size: 64
  • Number of optimizer steps: 3812
  • Number of epochs: roughly 2 epochs
  • Gradient Clip Value: 0.5 (only for parameters trained by AdamW)

Speeds, Sizes, Times

The training was conducted over 3812 optimizer steps.

  • For tiny variant, the training took around 6 hours on 1 Tesla T4 GPU.
  • For small variant, the training took around 7 hours on 2 Tesla T4 GPU (using DDP strategy).

Citation

BibTeX:

@online{khun2026,
  author = {Khun, Kimang},
  title = {TrorYongASR: {Permuted} {AutoRegressive} {Sequence}
    {Modeling} for {Automatic} {Speech} {Recognition}},
  date = {2026-05-07},
  url = {https://kimang18.github.io/krorngai-blog/TrorYongASR/},
  langid = {en}
}

Model Card Author

  • ឈ្មោះ: បណ្ឌិត ឃុន គីមអាង
  • Name: KHUN Kimang (Ph.D.)

Acknowledgement

LightningAI, Kaggle, and Google Colab did not specifically sponsor this project. But, both models are trained thanks to their free credits. So, huge thanks to LightningAI, Kaggle and Google Colab.

Thanks to the authors of PARSeq and Whisper for their publicly available sourcecode.

Thanks to openslr, Mozilla Data Collective and Google for their publicly available dataset.

Model Card Contact

If you have any questions, please reach out at Facebook Page.

Downloads last month
278
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train KrorngAI/TrorYongASR-small

Paper for KrorngAI/TrorYongASR-small