AI researchers achieve landmark breakthrough in Sinhala AI with publication in prestigious IEEE Access journal
May 14, 2026 12:35 pm
While global tech giants spend billions training artificial intelligence that still struggles in Sinhala, a Sri Lankan research team has quietly built a model that does it properly — using just two GPUs.
In a blind evaluation, the new Sinhala language model scored 4.5 out of 5, compared with just 1 out of 5 for the base Meta Llama 3.1 model on the same Sinhala prompts. The team also cut the model’s perplexity — a standard measure of how well an AI understands a language — by close to 90 percent, according to a statement.
In practical terms, the model can hold a natural conversation in Sinhala, answer questions, follow instructions, and stay coherent across long responses.
The research has been accepted for publication in IEEE Access, a peer-reviewed open-access journal of the Institute of Electrical and Electronics Engineers (IEEE), with a Journal Impact Factor of 3.6 in the 2024 Clarivate Journal Citation Reports and an h5-index above 200 on Google Scholar Metrics.
IEEE Access operates a binary review policy — reviewers either accept or reject a manuscript in the form submitted, with no revision cycle — a quality bar few low-resource-language AI papers have cleared.
Why this matters for Sri Lanka
Sinhala is spoken by more than 20 million people, but it is barely represented in the training data of the AI systems everyone is now talking about. Ask ChatGPT, Claude, or Gemini something in Sinhala and the answers sometimes tend to be broken, repetitive, or nonsense.
The deeper issue is sovereignty. Even when foreign AI tools do work in Sinhala, Sri Lanka has no control over them — the model weights, the training data, the safety rules, and ultimately the off-switch all sit with companies in the United States or China, the statement said.
For a country where most government, healthcare, and education conversations happen in Sinhala, depending entirely on AI built and operated abroad is a structural risk over data privacy, national security, cultural framing, and basic continuity of service when foreign policy, pricing, or licensing shifts.
A sovereign Sinhala LLM changes that equation. It can be hosted locally, audited locally, fine-tuned for Sri Lankan contexts, and continue to operate regardless of what any foreign tech company decides next — opening the door to Sinhala-speaking AI assistants for government services, educational tools for Sinhala-medium students, healthcare information for elderly and rural users, accessibility tools for citizens who do not speak English, and natural-sounding customer service for local businesses.
Built on a tight budget
Major AI labs in the United States use thousands of GPUs and spend hundreds of millions of dollars to train comparable systems. This team did it with two GPUs over a few weeks of training, and had to build its datasets from scratch because no large, clean Sinhala corpus existed.
The team scraped Sinhala news sites, books, and online sources, and used Hindi datasets as a starting point — Hindi and Sinhala share Indo-Aryan roots — to build a final dataset of around 3.6 million question-answer pairs and 4 billion tokens, one of the largest public Sinhala AI datasets, now freely available on Hugging Face, the statement said.
The team also redesigned how the model reads Sinhala. The original Llama tokenizer needed an average of 91 tokens per Sinhala sentence and failed on 97.5 percent of Sinhala characters at the byte level. After adding around 35,000 Sinhala-specific tokens, that dropped to 23 tokens per sentence and zero byte-level failures.
Who built it
The project was conducted at the Department of Electrical Engineering, University of Moratuwa, led by Sanjeewa Alwis, CEO of Decryptogen; Dr. Chathura Wanigasekara (Senior Member, IEEE) of the Institute of Maritime Technologies and Propulsion Systems at the German Aerospace Centre (DLR), Geesthacht; and Dr. Logeeshan Velmanickam (Member, IEEE), Senior Lecturer at the Department. Dr. Wanigasekara and Dr. Logeeshan are the corresponding authors.
The core engineering work — model training, dataset construction, tokenizer redesign, and evaluation — was carried out by P. K. Udith I. Sandaruwan, Nimesh M. A. Fonseka, and Pamith C. Salwathura (Student Member, IEEE), all University of Moratuwa graduates, working in collaboration with the Decryptogen R&D team. They earlier presented a preliminary version of the work at the IEEE AIIoT Congress in Seattle in 2025; the IEEE Access paper is the full, finalized version.
Sanjeewa Alwis has lead Decryptogen into an international operation across Europe, the United States, and Australia, with a focus on decentralized large language model training and blockchain-integrated AI. He has long argued that emerging regions need to build their own sovereign AI capacity rather than wait for foreign tech companies to include them, it added.
What comes next
Next steps include longer training runs, larger and more diverse Sinhala datasets, and deployments in assistive technologies and conversational systems for Sinhala speakers. The full paper, “End-to-End Adaptation of LLMs for Low-Resource Languages,” will appear in IEEE Access under DOI 10.1109/ACCESS.2026.3693119. The datasets are publicly available on Hugging Face.