Bengaluru-based Sarvam AI has launched a new large language model (LLM), Sarvam-1. This 2-billion-parameter model is optimised to support ten major Indian languages alongside English, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu, the official release said. The model addresses the technological gap faced by billions of speakers of Indic languages, which have largely been underserved by existing large language models (LLMs).
Also Read: Mistral AI Unveils New Models for On-Device AI Computing
Key Features and Performance Enhancements
Sarvam-1 was built from the ground up to improve two critical areas: token efficiency and data quality. According to the company, traditional multilingual models exhibit high token fertility (the number of tokens needed per word) for Indic scripts, often requiring 4-8 tokens per word compared to 1.4 for English. In contrast, Sarvam-1's tokeniser achieves improved efficiency, with token fertility rates of just 1.4-2.1 across all supported languages.
Sarvam-2T Corpus
A significant challenge in developing effective language models for Indian languages has been the lack of high-quality training data. "While web-crawled Indic language data exists, it often lacks depth and quality," Sarvam AI noted.
To address this, the team created Sarvam-2T, a training corpus consisting of approximately 2 trillion tokens, evenly distributed across the ten languages, with Hindi making up about 20 percent of the data. Using advanced synthetic-data-generation techniques, the company has developed a high-quality corpus specifically for these Indic languages.
"The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India", said Pratyush Kumar, Co-Founder, Sarvam. He added; "We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI. We are deeply excited to be working together with Nvidia towards this mission."
"Enterprises are seeking to leverage generative AI to accelerate innovation and tackle complex challenges at scale," said Kari Briski, vice president of AI software, models and services at Nvidia. "Sarvam AI's multilingual model, developed using Nvidia's full-stack AI platform including NeMo and Hopper GPUs, showcases how tailored AI solutions can address linguistic diversity and drive inclusive technological growth in regions like India."
Edge Device Deployment
According to the company, Sarvam-1 has demonstrated exceptional performance on standard benchmarks, outperforming comparable models like Gemma-2-2B and Llama-3.2-3B, while achieving similar results to Llama 3.1 8B. Its compact size allows for 4-6x faster inference, making it particularly suitable for practical applications, including edge device deployment.
Also Read: Google Announces AI Collaborations for Healthcare, Sustainability, and Agriculture in India
Key Improvements
Key improvements in Sarvam-2T include twice the average document length compared to existing datasets, a threefold increase in high-quality samples, and a balanced representation of scientific and technical content.
Sarvam claims Sarvam-1 is the first Indian language LLM. The model was trained on Yotta’s Shakti cluster, utilising 1,024 GPUs over a five-day period, with Nvidia's NeMo framework facilitating the training process.