Sarvam AI Launches Sarvam-1, New Language Model Optimised for Indian Languages

The model was trained on Yotta’s Shakti cluster, utilising 1,024 GPUs over a five-day period, with Nvidia's NeMo framework facilitating the training process.

Highlights

  • Supports ten Indian languages and English, including Bengali, Gujarati, Hindi, and more.
  • Built on a 2-trillion-token dataset, evenly distributed across ten languages except for Hindi.
  • Trained on Yotta's Shakti cluster using 1,024 GPUs over five days.

Follow Us

Sarvam AI Launches Sarvam-1, New Language Model Optimised for Indian Languages
Bengaluru-based Sarvam AI has launched a new large language model (LLM), Sarvam-1. This 2-billion-parameter model is optimised to support ten major Indian languages alongside English, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu, the official release said. The model addresses the technological gap faced by billions of speakers of Indic languages, which have largely been underserved by existing large language models (LLMs).

Also Read: Mistral AI Unveils New Models for On-Device AI Computing




Key Features and Performance Enhancements

Sarvam-1 was built from the ground up to improve two critical areas: token efficiency and data quality. According to the company, traditional multilingual models exhibit high token fertility (the number of tokens needed per word) for Indic scripts, often requiring 4-8 tokens per word compared to 1.4 for English. In contrast, Sarvam-1's tokeniser achieves improved efficiency, with token fertility rates of just 1.4-2.1 across all supported languages.

Sarvam-2T Corpus

A significant challenge in developing effective language models for Indian languages has been the lack of high-quality training data. "While web-crawled Indic language data exists, it often lacks depth and quality," Sarvam AI noted.

To address this, the team created Sarvam-2T, a training corpus consisting of approximately 2 trillion tokens, evenly distributed across the ten languages, with Hindi making up about 20 percent of the data. Using advanced synthetic-data-generation techniques, the company has developed a high-quality corpus specifically for these Indic languages.

"The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India", said Pratyush Kumar, Co-Founder, Sarvam. He added; "We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI. We are deeply excited to be working together with Nvidia towards this mission."

"Enterprises are seeking to leverage generative AI to accelerate innovation and tackle complex challenges at scale," said Kari Briski, vice president of AI software, models and services at Nvidia. "Sarvam AI's multilingual model, developed using Nvidia's full-stack AI platform including NeMo and Hopper GPUs, showcases how tailored AI solutions can address linguistic diversity and drive inclusive technological growth in regions like India."

Edge Device Deployment

According to the company, Sarvam-1 has demonstrated exceptional performance on standard benchmarks, outperforming comparable models like Gemma-2-2B and Llama-3.2-3B, while achieving similar results to Llama 3.1 8B. Its compact size allows for 4-6x faster inference, making it particularly suitable for practical applications, including edge device deployment.

Also Read: Google Announces AI Collaborations for Healthcare, Sustainability, and Agriculture in India

Key Improvements

Key improvements in Sarvam-2T include twice the average document length compared to existing datasets, a threefold increase in high-quality samples, and a balanced representation of scientific and technical content.

Sarvam claims Sarvam-1 is the first Indian language LLM. The model was trained on Yotta’s Shakti cluster, utilising 1,024 GPUs over a five-day period, with Nvidia's NeMo framework facilitating the training process.

Reported By

Kirpa B is passionate about the latest advancements in Artificial Intelligence technologies and has a keen interest in telecom. In her free time, she enjoys gardening or diving into insightful articles on AI.

Recent Comments

Ajay :

@TheAndroidFreak, Lots of Q for you. Thanks in advance for your effort. From your testing experience, how is the quality…

OnePlus 13 and Xiaomi 15 to Feature Qualcomm Snapdragon 8…

TheAndroidFreak :

I don't agree at all. It will get completed in 2026 itself. Maybe 300000 as well.

Reliance Jio's All Data Plans will Work with Voice Plans…

Kaushik IMA :

Currently all telcos offer leased line and business broadband solutions, wired or wireless. mmWave will help build capacity for high…

TRAI Gives Recommendations for 37 GHz, 42 GHz IMT Spectrum…

pratulk09 :

What are our chances of good opposition in 2029 elections?

BSNL Capital Infusion Sees a Drop of 53% in Budget…

pratulk09 :

BSNL is the costliest here even with 17 days validity it does not have any other bundled benefits. BSNL and…

Jio, Airtel, Vi, BSNL Minimum Recharge Plans for Keeping SIM…

Load More
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments