Huggingface

PleIAs/CommonLingua · Hugging Face

Brief

CommonLingua is a compact (2.35M-parameter) byte-level language identification model released 2026-04-28 and trained on 2,482,568 paragraphs drawn from Structured Wikipedia and the PleIAs Common Corpus, with explicit focus on OCR-noisy realistic documents and long-tail coverage including 61 African languages. The model operates on raw UTF-8 bytes (512-byte padded inputs), uses a 4096-bucket polynomial trigram-hash embedding, three causal Conv1D layers and a single bidirectional attention layer with RoPE; ablations report the trigram signal improved macro F1 by +1.2 points. On the CommonLID benchmark (376k held-out paragraphs, 200+ languages) CommonLingua achieves strict accuracy 77.63%, equivalent accuracy 82.92% and macro F1 0.7879, a 11.5-point macro F1 gain over the next-best baseline. The authors publish the training split under open licenses and provide a predict.py example and throughput numbers (H100 bf16 ≈ 26.2k texts/sec at bs=4096).

Why it matters

CommonLingua is a 2.35M-parameter byte-level language identification model trained on 2,482,568 paragraphs (Structured Wikipedia + PleIAs Common Corpus) and released 2026-04-28 in partnership with the GSMA "AI Language Models in Africa, by Africa, for Africa" initiative.

Key details

  • Architecture uses no tokenizer (raw UTF-8 bytes padded to 512), a 4096-bucket trigram hash embedding, three causal Conv1D layers and one bidirectional attention layer with RoPE — trigram embedding gave +1.2 macro F1 in ablations.
  • Evaluation on CommonLID (376k held-out paragraphs, 200+ languages): 334 labels, strict accuracy 77.63%, equivalent-class accuracy 82.92%, macro F1 0.7879 — +11.5 macro F1 points over the next best baseline.
  • Throughput: H100 (bs=4096) fp32 ≈ 10,962 texts/sec, bf16 ≈ 26,236 texts/sec (2.4× speedup); Sapphire Rapids (8 threads, bs=32) fp32 = 183, bf16 = 553 texts/sec (3.0×).
Reader · no content

No body text on file.

Open the original to read the full piece.