BERT base for Dhivehi

Pretrained model on Dhivehi language using masked language modeling (MLM).

Tokenizer

The WordPiece tokenizer uses several components:

Normalization: lowercase and then NFKD unicode normalization.
Pretokenization: splits by whitespace and punctuation.
Postprocessing: single sentences are output in format [CLS] sentence A [SEP] and pair sentences in format [CLS] sentence A [SEP] sentence B [SEP].

Training

Training was performed over 16M+ Dhivehi sentences/paragraphs put together by @ashraq. An Adam optimizer with weighted decay was used with following parameters:

Learning rate: 1e-5
Weight decay: 0.1
Warmup steps: 10% of data

Downloads last month: 9