Sarvam AI Launches Sarvam-1, New Language Model Optimised for Indian Languages

The model was trained on Yotta’s Shakti cluster, utilising 1,024 GPUs over a five-day period, with Nvidia's NeMo framework facilitating the training process.

Most readers read for free. A small group from the TelecomTalk community keeps this going. Support only if our work adds value for you.

Highlights

  • Supports ten Indian languages and English, including Bengali, Gujarati, Hindi, and more.
  • Built on a 2-trillion-token dataset, evenly distributed across ten languages except for Hindi.
  • Trained on Yotta's Shakti cluster using 1,024 GPUs over five days.

Follow Us

Sarvam AI Launches Sarvam-1, New Language Model Optimised for Indian Languages
Bengaluru-based Sarvam AI has launched a new large language model (LLM), Sarvam-1. This 2-billion-parameter model is optimised to support ten major Indian languages alongside English, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu, the official release said. The model addresses the technological gap faced by billions of speakers of Indic languages, which have largely been underserved by existing large language models (LLMs).

Also Read: Mistral AI Unveils New Models for On-Device AI Computing




Key Features and Performance Enhancements

Sarvam-1 was built from the ground up to improve two critical areas: token efficiency and data quality. According to the company, traditional multilingual models exhibit high token fertility (the number of tokens needed per word) for Indic scripts, often requiring 4-8 tokens per word compared to 1.4 for English. In contrast, Sarvam-1's tokeniser achieves improved efficiency, with token fertility rates of just 1.4-2.1 across all supported languages.

Sarvam-2T Corpus

A significant challenge in developing effective language models for Indian languages has been the lack of high-quality training data. "While web-crawled Indic language data exists, it often lacks depth and quality," Sarvam AI noted.

To address this, the team created Sarvam-2T, a training corpus consisting of approximately 2 trillion tokens, evenly distributed across the ten languages, with Hindi making up about 20 percent of the data. Using advanced synthetic-data-generation techniques, the company has developed a high-quality corpus specifically for these Indic languages.

"The Sarvam 1 model is the first example of an LLM trained from scratch with data, research, and compute being fully in India", said Pratyush Kumar, Co-Founder, Sarvam. He added; "We expect it to power a range of use cases including voice and messaging agents. This is the beginning of our mission to build full stack sovereign AI. We are deeply excited to be working together with Nvidia towards this mission."

"Enterprises are seeking to leverage generative AI to accelerate innovation and tackle complex challenges at scale," said Kari Briski, vice president of AI software, models and services at Nvidia. "Sarvam AI's multilingual model, developed using Nvidia's full-stack AI platform including NeMo and Hopper GPUs, showcases how tailored AI solutions can address linguistic diversity and drive inclusive technological growth in regions like India."

Edge Device Deployment

According to the company, Sarvam-1 has demonstrated exceptional performance on standard benchmarks, outperforming comparable models like Gemma-2-2B and Llama-3.2-3B, while achieving similar results to Llama 3.1 8B. Its compact size allows for 4-6x faster inference, making it particularly suitable for practical applications, including edge device deployment.

Also Read: Google Announces AI Collaborations for Healthcare, Sustainability, and Agriculture in India

Key Improvements

Key improvements in Sarvam-2T include twice the average document length compared to existing datasets, a threefold increase in high-quality samples, and a balanced representation of scientific and technical content.

Sarvam claims Sarvam-1 is the first Indian language LLM. The model was trained on Yotta’s Shakti cluster, utilising 1,024 GPUs over a five-day period, with Nvidia's NeMo framework facilitating the training process.

Most readers read for free. A small group from the TelecomTalk community keeps this going. Support only if our work adds value for you.

Reported By

Kirpa B is passionate about the latest advancements in Artificial Intelligence technologies and has a keen interest in telecom. In her free time, she enjoys gardening or diving into insightful articles on AI.

Recent Comments

lbp :

I hope VI launch 5g airfibre or portable 5g dongle similar to USA t mobile home internet, VI must launch…

Vodafone Idea to Expand 5G to 90 More Cities by…

TheAndroidFreak :

Switch to better operator.

Vodafone Idea to Expand 5G to 90 More Cities by…

Rohit :

I don’t think enabling 5G on just a handful of towers can really be called a proper launch. A true…

Vodafone Idea to Expand 5G to 90 More Cities by…

SSS :

It is a good news. Atleast a third option for the customer.

Vodafone Idea to Expand 5G to 90 More Cities by…

TheAndroidFreak :

Avoid. Who buys Dimensity 6*** at 30K+. 1. Oppo K13 Pro Plus 2. OnePlus Nord 5 3. Poco F7 4.…

Realme 16 5G India Launch Date Confirmed and More

Load More
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments