Now supporting 22 Indian languages and structured long-form text
.png)
Download the model from Hugging Face, try it on our playground, and build with our APIs.
Making content available across languages has been a key consideration for digital accessibility. While translation systems have continued to evolve, there remains work to be done in three broad directions - one, support for more languages; two, support for more natural translation of stylised (such as idiomatic) long-form text; and three, support for structured text in different formats. These formats could be varied depending on the source such as a math textbook with equations, or a web page with HTML code around content, or an output of digitising an image with potential OCR-related errors.
Although multilingual large language models have demonstrated the ability to do long-form translation, their performance on Indian languages still trails behind. At Sarvam, we’re working to address this by focusing on the 22 Scheduled Indian languages, with an emphasis on more natural translation while supporting long-form structured content.
We are happy to share that we have made a significant step forward in this with our latest model, Sarvam-Translate. Trained by fine-tuning Gemma3-4B-IT, Sarvam-Translate supports 22 Indian languages - Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Urdu, Kannada, Odia, Malayalam, Punjabi, Assamese, Maithili, Santali, Kashmiri, Nepali, Sindhi, Dogri, Konkani, Manipuri (Meitei), Bodo, Sanskrit. It supports paragraph-level translation for 22 languages, and supports translating diverse structured content for 15 languages.
In human evaluation by language experts, Sarvam-Translate is identified to be significantly better than much larger models such as Gemma3-27B-IT, Llama4 Scout, and Llama-3.1-405B-FP8. Further, in automated evaluation on ability to conform to structured long-form content, Sarvam-Translate shows high accuracy (> 4.9 over 5) for 15 languages.
Sarvam-Translate is available to try out and use in your applications on our API store. Further, to enable others to use and build upon the model, we are releasing an open-weights model on Hugging Face. This continues our commitment to build in the open for enabling a sovereign AI ecosystem for India.
Below is a feature matrix for each language supported by Sarvam-Translate.
Example Use-Cases of Sarvam-Translate
Below, we share examples to highlight the different ways in which Sarvam-Translate could be used.
Translating web pages without breaking the HTML structure
Translating web content is one of the central use-cases of translation models. However, extracting, translating, and reinserting text is often error-prone and tedious. Sarvam-Translate streamlines this by translating only the visible textual content, preserving all HTML tags and structure. As seen in the image below various elements including emphasis etc are maintained in the text.
Translating LaTeX documents while preserving syntax
Academic and technical documents in LaTeX combine human-readable text with formatting and command syntax. Maintaining the integrity of LaTeX code during translation is challenging. Sarvam-Translate identifies and translates only the human-readable content while preserving LaTeX syntax and structure. In the example below tables, and formatting are retained. Also notice the model's choice of retaining author names in cited papers in English.
Translating chemistry documents
Chemistry documents often mix specialised chemical notations and equations. Ensuring these remain unaltered during translation requires fine control. Sarvam-Translate accurately translates the surrounding text while preserving chemical equations and formatting. Also notation such as x and y are retained in Roman characters.
Translating idioms, slang and cultural references
Idiomatic expressions, figures of speech, and culturally specific phrases often lose meaning when translated literally. We find that Sarvam-Translate often produces translations that preserve the original tone, intent, and nuance. Notice in the following how several idiomatic phrases are correctly translated. For example, 'being behind the eight ball' an analogy from the game of pool meaning being in trouble is translated appropriately as 'జయేష్ అప్పటికే చాలా కష్టాల్లో ఉన్నాడు' inTelugu.
Translating folk tales while preserving cultural nuance
Narrative text such as folk tales and fiction carries cultural depth and stylistic nuance. Capturing this while maintaining readability across languages is complex. Our model handles narrative flow, cultural expressions, and stylistic elements to produce faithful translations. For example, in Maithili the phrase "One who interferes in other's work" has been translated to "जे दोसरक काजमे टाँग अड़ाबैत अछि" which is an appropriate use of a common idiom.
Translating social media posts with slang & emojis
Social media posts frequently contain informal language, slang, emojis, and unconventional structure, posing unique translation challenges. We find that Sarvam-Translate is often able to handle such content, accurately conveying both meaning and tone. Notice in the example below both usage of various emojis and stylised content. Example "Wut u doing rn?" is translated appropriately in Kannada as "ಏನ್ ಮಾಡ್ತಾ ಇದೀಯಾ?"
Translating subtitle files while maintaining timing and formatting
Subtitle (SRT) files require precise alignment of translated text with timing and formatting cues. Our model translates only the spoken dialogue while maintaining subtitle structure and synchronization. This can enable a single API click to enable accessible SRT files.
Translating documents with embedded foreign-language text
Multilingual documents often contain embedded foreign-language text that in most situations is expected to remain untranslated. Sarvam-Translate demonstrates the ability to identify such segments and preserve them while translating surrounding content. In the example below, we have traditional Chinese characters which are preserved.
Translating legal documents with precision
Legal texts demand high precision in terminology and structure. Translating such documents requires preserving legal references, clauses, and formatting. We find that Sarvam-Translate is able to ensure accuracy and consistency suitable for professional legal use. For example, consider the complex sentence "Upon a meticulous perusal of the statutory scheme, the legislative history of the Parent Act, and the precedents cited at the Bar, this Court is disinclined to accede to the appellant's submissions". The model is able to translate this in Assamese as "বিধিবদ্ধ আঁচনি, মূল আইনৰ বিধিবদ্ধ ইতিহাস, আৰু বাৰত উল্লেখ কৰা পূৰ্ব উদাহৰণসমূহৰ এক নিখুঁত পৰ্যালোচনাৰ পিছত, এই আদালতে আপীলকাৰীৰ যুক্তিসমূহ মানি ল'বলৈ অনিচ্ছুক" which is a more natural sentence structure and correctly conveys the technical content.
Translating code files while protecting syntax
Translating code files requires distinguishing between executable code and human-readable text, such as comments or documentation. We find that Sarvam-Translate selectively translates natural language content while leaving code unchanged. For example, the comment "Function to perform Bubble Sort" is translated in Bengali as "বাবল সর্ট করার ফাংশন" retaining technical terms like bubble sort without undue complicated translation.
Given the generic abilities of the model, we believe new use-cases can be unlocked. We encourage you to try Sarvam-Translate and share your findings with us on social media or our discord.
In the remainder of this blog, we will go through some technical details in terms of evaluation and training of the model.
Automatic Evaluation on Structured Content
To test the abilities of Sarvam-Translate, we conducted a large-scale evaluation that spans multiple languages, document styles, and content formats. For this evaluation, we used a curated dataset of articles covering a range of real-world formats and content types. The dataset included GitHub Markdown files, scanned PDFs that were digitized into Markdown using Vision-Language Models (VLMs), documents containing mathematical equations written in LaTeX, and chemistry documents featuring complex chemical notations. It also incorporated code files with embedded comments and documentation, as well as web page content extracted from HTML. We created 1,000 documents in each of these categories.
Given the scale of the evaluation, conducting human reviews for every document type and language pair would be impractical. To address this, we used Gemini Flash 2.5 to perform automatic evaluations. For each document type, we designed prompts that direct Gemini to focus on the specific aspects of translation quality that matter most for that format. For example, the evaluation criteria for math equations differ from those for code or HTML. We describe these criteria below.
Markdown Content (GitHub)
Goal: Ensure that the translated document preserves the Markdown structure including headings, bullet points, links, code blocks, and that the translated content fits naturally within the structure. The formatting should remain exactly as in the source.
Digitized Markdown (VLM / OCR output)
Goal: Evaluate how robust the translation is when the source document comes from a digitized, OCR-extracted source which may contain slight errors or inconsistencies. The structure (tables, headings) should be preserved, and the translation should handle noise gracefully without introducing additional errors.
Math Content (LaTeX equations)
Goal: Ensure that LaTeX equations are preserved exactly in the translated document, while surrounding text is translated naturally. No part of the equation syntax should be altered or corrupted. The translation must maintain the integrity of mathematical notation.
Chemistry Content (Equations, Symbols)
Goal: Validate that chemical equations, including subscripts, superscripts, arrows, and special symbols, are retained correctly. The text around the equations should be translated naturally, while chemical notations should remain untouched and precisely formatted.
Code Content (Code with Comments)
Goal: Ensure that code remains exactly the same in the translated document, and that only the comments and documentation are translated. The programming syntax must not be altered. The evaluation should also check that no unintended changes (indentation, special characters) were introduced.
HTML Content (Web pages)
Goal: Verify that HTML tags and structure are preserved exactly. The visible text content within the tags should be translated fluently, but the HTML itself should remain unchanged. If certain elements (e.g. italics, bold) are present in the original, the corresponding translation should maintain the same styling.
Evaluation Results
The table below summarises evaluation scores (on a scale of 5) across languages averaged for different content types described above:
Human Evaluations
While automatic evaluations are useful at scale, human evaluations are necessary given the subjectivity of translation quality and limited abilities of frontier models in specific languages. To do this, we curated 100 English documents covering a diverse mix of content types, including technical material such as scientific, mathematical, and chemistry-based content; informal and spoken text drawn from speech transcripts and conversational blogs; structured content such as Markdown documents, HTML pages, and code snippets; and formal content like news articles and textbook excerpts.
These documents were translated using Sarvam-Translate, as well as with leading open-source LLMs, such as Gemma3-27B-IT, Llama-3.1-405B-FP8 and Llama4 Scout. The translated outputs were then evaluated by professional human annotators. The evaluators were professional language experts, each with multiple years of professional experience in translation creation and validation, and with deep proficiency in both English and their target Indian language.
The human evaluators assessed the translations on several key dimensions, including fluency, adequacy, faithfulness to the source structure, and inclusivity. They were shown two translations at random and asked to pick if one is more preferred or both are equally preferred. The results of the human evaluation are summarised in the following tables.
Gemma3 27B ITvs Sarvam Translate
Llama 4 Scout vs Sarvam Translate
Llama 3.1 405B FP8 vs Sarvam Translate
Across all Indian languages, Sarvam-Translate consistently outperformed other models, particularly in its ability to handle structured content, maintain coherence over longer contexts, and deliver inclusive and culturally sensitive translations.
How Sarvam-Translate was trained
This journey has been years in the making, built on sustained effort and deep expertise in developing Indian language technologies in the open source. We invested heavily in building robust data-cleaning pipelines and sophisticated annotation workflows to ensure the highest quality datasets. We also leveraged the Gemma 3 open-source models which provided the best starting point to build Sarvam-Translate in comparison to any other model.
Data
Sarvam-Translate was trained on a rich and diverse dataset of translation pairs between English and 22 Indian languages. This dataset combines multiple sources. First, we incorporated cleaned data from past open-data efforts, including BPCC, which itself contains both mined and manually validated data. We carefully cleaned this data using robust internal pipelines. Second, we generated new translation pairs from carefully curated English source content. This spanned a wide range of domains: scientific and historical content, conversational and modern text, and structurally complex formats such as code, LaTeX, HTML, and chemistry equations. In this process, we recognised the need for very high quality filters on the data. Even large models such as Llama-3.1-405B-FP8make many errors in generating output in Indian languages.
Training Process
We trained Sarvam-Translate on top of Gemma3-4B-IT. We fine-tuned this in a two-stage process. In the first stage, we fine-tuned the full model on a larger dataset with broad coverage, including some noisier but domain-diverse data to establish wide-ranging translation capability. This is also required to provide language ability to the model in languages it is not already fluent in. In the second stage, we used LoRA to fine-tune the model further on a smaller, highly curated, format-diverse dataset, paying careful attention to format preservation and style consistency. Through various ablations we found this two-stage process to be effective.
Inference Efficiency
We were able to quantise Sarvam-Translate using Post-Training Quantization (PTQ), leveraging a large and diverse calibration dataset to ensure robust 8-bit inference performance. The inference system is finely tuned to run on NVIDIA NIM with the TensorRT engine, utilising full FP8 kernels for improved throughput and efficiency. This optimised set up is available for use in our API store and can be tried in the dashboard.
Known Limitations
While this model supports 22 languages across a variety of tasks, performance can vary depending on the language. These differences stem from the balance of pre-training data, post-training resources, and each language’s representation in the tokeniser. Document translation is a key capability, but performance is more limited for certain languages such as Bodo, Dogri, Kashmiri, Manipuri, Santali, Sanskrit, and Sindhi, where we have observed lower translation quality and occasional incomplete outputs.
For better-supported languages, the model performs well on most document formats. However, it has not been extensively trained on long-form LaTeX or HTML documents. As a result, it may sometimes miss tags or other structural elements in very large .tex files or .html files. To maintain accuracy, we recommend splitting large files of code or latex into smaller sections and then translating individual sections, when possible.
In addition, we have infrequently observed that some outputs may include transliterations or code-mixed segments, particularly in low-resource or highly inflected languages.
Conclusion
Translation technology has progressed substantially over the past few years. With Sarvam-Translate, we extend these advancements to 22 Indian languages, ensuring their representation across a wide range of content types.
What makes this achievement especially meaningful is the model’s ability to handle real-world, mixed-format documents such as Markdown, HTML, scientific notation, code, and more. The model not only preserves structure but also respects context, style, and gender nuances, ensuring that every translation feels authentic and natural.
We believe that this is a key enabler for:
• Making the web more accessible in Indian languages
• Supporting education and research in native languages
• Empowering government and public services to reach citizens in their preferred language
• Catalyzing the creation of Indian language digital content at scale
This is just the beginning. As content formats continue to evolve, our mission remains the same: to ensure that Indian languages are well-represented in the digital landscape.
We remain committed to advancing this work through collaboration with the open-source community, researchers, industry, and government partners, making high-quality translation accessible for all 22 Indian languages.
We invite you to explore Sarvam-Translate, try it out, give us feedback, and join us in shaping the next frontier of Indian language technology.