1.4 Supervised Finetuning
Using the above created dataset, we finetuned the Mistral 3.1 24B model. As a preliminary step, we removed the vision encoder and carefully verified that this modification did not affect the model's performance in text-only mode.
We trained for a hybrid model in both 'non-think' and 'think' modes. In the 'think' mode, the model generates reasoning tokens within <think>...</think> tags in English before producing its final response in the target language. Through multiple experiments, we identified the optimal sequence for training data presentation.
Interestingly, we discovered that simultaneously training for both think and non-think modes was ineffective, which contradicts recent literature. This indicated that to enhance the base model’s relatively lower performance on Indian languages, we needed to prioritize training in the non-think mode first, which contained a substantially higher proportion of Indian language tokens.
Based on these findings, we implemented a two-phase training approach: 2 epochs in non-think mode, followed by 2 epochs in think mode. We also employed model merging techniques between and after these training phases.
Our experiments included testing both Dare-Ties and Slerp algorithms with various checkpoint combinations. The most effective method proved to be merging the epoch 1 and epoch 2 checkpoints after each training phase using the Slerp algorithm. The resulting merged model demonstrated performance equal to or better than the constituent models across nearly all benchmarks we evaluated. These results are demonstrated here for a few benchmarks.
Section 2: Reinforcement Learning
The value of Reinforcement Learning in improving models’ scores, especially in math and programming, is well recognised. In our experiments, we found that a carefully curated reinforcement learning with verifiable rewards (RLVR) helped improve scores across most benchmarks. We discuss our pipeline below.
2.1 Curriculum of Tasks
In our initial experiments, we combined batches of data from multiple tasks into a single RLVR run. However, joint training led to several challenges:
• Imbalanced learning: The model prioritized easier instances across tasks, while harder, more critical examples saw limited improvement. For example, when GSM8K was trained alongside a function calling dataset, GSM8K showed strong gains, but function calling improved by only ~1%.
• Verification inefficiency: Verification time varied widely across datasets—some required significantly more time, bottlenecking the process. Additionally, coding tasks benefit from batched verification, which isn't feasible when mixing samples from multiple datasets.
• Sequence length mismatch: Different datasets required different maximum sequence lengths. A high sequence length setting (needed for some tasks) negatively impacted training efficiency and performance on tasks with shorter inputs.
Based on several ablation studies, we identified an effective task-wise curriculum. While the precise order of tasks had minimal impact on model performance, we designed a sequence that alternates between reasoning and language tasks to foster balanced skill development:
• Math skills (GSM8K and MATH): For GSM8K, we incorporated Indian language prompts in both native and romanized scripts alongside English prompts. This multilingual approach improved scores across all languages. Our final dataset comprised 40% English data, 40% data in Indian languages using native scripts, and 20% data in Indian languages using romanized script. Of the Indian language content, 28% was in Hindi, with 8% in each of the other nine languages. We prompted the model to generate responses in a fixed format to facilitate extraction, which proved more effective than few-shot prompting, especially for Indian language prompts.
• Advanced mathematics (Big Math): This dataset contains more challenging math problems. We instructed the model to generate responses within a LaTeX box command for straightforward verification against the ground truth.
• Instruction following (Extended IFEval): We used an expanded version of base IFEval dataset that included additional datasets for Indian language tasks and multi-turn interactions. We selected a subset of constraints from the 25 outlined in Instruction-Following Evaluation for Large Language Models, focusing on those providing sufficient reward signals to drive model improvement. Examples include "Numbered Bullets," "Title," and "Minimum Number of Highlighted Sections." Sequencing instruction-following tasks early in the curriculum proved beneficial, as it improved scores across various benchmarks, including tasks such as MMLU Pro.
• Code Understanding: For this task, the model predicts the output of a given code snippet and input string. Verification required an exact match between the predicted output and ground truth. We use the code understanding subset of the Synthetic-1 dataset and translate selected prompts into Indian languages to promote diversity and cross-lingual generalization.
• Code Generation: We use a high-quality subset of the PrimeIntellect dataset. This required building a reliable infrastructure for sandboxed code execution, which processed a queue of (code, test_case) pairs and aggregated results to compute rewards. We focused on 'stdin-stdout' tasks with standard input and output specifications. We implemented flexible matching criteria, allowing for whitespace variations in strings and approximations up to 1e-6 for numerical values.
• Translation: The final task focused on improving translations between English and Indian languages in both directions, where we rewarded the model if its output had a higher chrF++ score compared to a previous baseline.
We noticed that certain choices of sequence of RLVR tasks did not have a significant impact on scores. For instance, changing the sequence of GSM8K and IFEval did not significantly impact scores as shown below.
|
IFEval ➝ GSM8K |
GSM8K ➝ IFEval |
GSM8K |
0.91 |
0.91 |
GSM8K (Indic with Romanized Script) |
0.82 |
0.81 |
IFEval |
0.88 |
0.89 |
2.2 Group Relative Policy Optimization
We adopt Group Relative Policy Optimization (GRPO), an efficient alternative to PPO that eliminates the need for training a separate value function. GRPO employs group-based sampling, where multiple outputs (rollouts) are generated for each prompt using the old policy. The relative rewards within each group are then used to compute the advantage for policy updates, which aligns naturally with the comparative nature of reward models.
This approach helped us significantly reduce memory overhead and proved to be considerably more cost-effective than traditional PPO. Through extensive experimentation, we conducted a learning rate sweep and identified 3e-7 as the optimal value for most tasks. However, we observed that more challenging datasets requiring stronger reasoning capabilities—such as math and code generation—performed better with a reduced learning rate of 2e-7, highlighting the model's sensitivity to dataset complexity.
To maximize efficiency, we dynamically adjusted batch sizes based on the dataset-specific maximum generation length. Since different datasets impose varying token length limits, we calibrated the batch size accordingly to optimize GPU utilization within given memory constraints.
2.3 Prompt Sampling Strategies
For each RLVR task, we implemented a prompt sampling approach targeting a pass-through rate of approximately 20% on the model being trained. This means we selected prompts such that the model generated correct responses about 20% of the time before the RLVR run began. Our ablation studies revealed that higher pass-through rates negatively impacted accuracy gains by up to 6%, while lower rates, though not detrimental to accuracy, increased training time to reach equivalent performance levels as the model needed to explore the solution space more extensively to discover the optimal policy.
We further refined our sampling by filtering out all prompts with a perfect pass@8 score—that is, if all 8 rollouts on a particular prompt yielded correct results, we excluded that prompt from training because the resulting advantage is zero and hence it does not contribute to the policy update.
For Code Generation tasks, we limited each input to approximately 15 test cases. In this selection process, we prioritized test cases based on the string length of the test case as a proxy for difficulty level. This approach served a dual purpose: it improved execution efficiency, and made it more challenging for the model to exploit the reward function by solving only the simpler test cases, leading to more robust learning outcomes.
2.4 Reward Engineering
For most RLVR tasks, we employed a straightforward binary reward system that classified responses as either correct or incorrect. However, for Code Generation tasks, we implemented a more nuanced partial reward scheme to address the sparsity challenge inherent in binary rewards for complex coding problems.
Specifically, our Code Generation reward consisted of two components: (i) the fraction of test cases that successfully passed code execution, and (ii) bonus reward applied when all test cases passed. This graduated reward structure proved highly effective in improving the model's code generation capabilities.
For Translation tasks, we observed that implementing a binary reward based on a fixed threshold of translation quality metrics like chrF++ was ineffective due to the high variability of chrF++ values across different sentences. To address this, we developed a 'relative reward score' with the following structure: (i) a score of 0.5 if the chrF++ metric exceeded the pre-RLVR baseline by a specified lower threshold, and (ii) a score of 1.0 if the chrF++ metric either exceeded the baseline by a higher threshold or surpassed a predefined global chrF++ threshold. This approach led to significant improvements in translation accuracy. To the best of our knowledge, these reward engineering techniques represent novel contributions to the field and were effective in enhancing model performance.
Below, we show a couple of examples of how the model’s performance has improved with our RLVR curriculum.
Reasoning
Indic
Our findings strongly suggest that partial rewards significantly facilitate learning across various tasks. Extending this approach to domains like mathematical reasoning remains an open and promising area for exploration. We believe this approach holds substantial potential for improving model performance in these more challenging domains.
Section 3. Results of Sarvam-M
We report results across five comprehensive benchmark categories:
• General Knowledge: MMLU, MMLU Pro, ARC-C, TriviaQA, GPQA
• Programming: HumanEval, MBPP, LiveCodeBench
• Mathematics: GSM8K, MATH
• Task and Conversational Adherence: IFEval, MTBench, AlpacaEval, WildBench
• Indic Language Performance: MILU, IndicGenBench
For several benchmarks, we create Indic variations using our translation and transliteration models. These variations are available in two formats:
• Native script (denoted with "-IN" suffix) - e.g., MMLU-IN, ARC-C-IN, TriviaQA-IN, GSM8K-IN
• Romanized script (denoted with "-IN-R" suffix) - e.g., MMLU-IN-R, GSM8K-IN-R
We release these new benchmark datasets as part of our indic-evals collection on Hugging Face.
The MILU benchmark contains both English and Indic queries, which we have separated into MILU-EN and MILU-IN for more granular analysis.
As shown in the introductory table, Sarvam-M consistently demonstrates superior or highly competitive performance across diverse benchmarks, particularly excelling in Indian language tasks, programming, mathematical reasoning, and multilingual capabilities, thereby making it a very capable model.
On Indic-focused benchmarks, Sarvam-M outperforms other models across the board, particularly on IndicGenBench (0.49), MILU-EN (0.83), and significantly on MILU-IN (0.75). It also excels on general knowledge tasks, leading on regionally adapted evaluations such as MMLU-IN (0.79) and ARC-C-IN (0.88), while achieving the highest overall score on ARC-C (0.95)
In addition to Indian language tasks, the model stands out on reasoning benchmarks. In programming benchmarks, Sarvam-M again is quite capable achieving the highest scores on HumanEval (0.88), MBPP (0.75), and notably outperforming other models by a considerable margin on LivecodeBench (0.44). For mathematical reasoning tasks, it achieves the highest performance on GSM-8K-IN (0.92), GSM-8K-IN-R (0.82), and maintains competitive scores on GSM-8K (0.94) and the MATH benchmark (0.81).
3.1 Indic Vibe Check Benchmark
To better evaluate how people would engage with our model, we developed "Indic Vibe Check". We found this benchmark useful for understanding the model's effectiveness across varied conversational contexts, ensuring we are not over-optimizing for standard benchmark performance.
Drawing from Anthropic's Economic Index, we used individual entries as foundational tasks, and prompted Gemini-1.5-Pro to generate realistic chat queries that actual users might pose. We employed a streamlined prompt engineering methodology to create these scenarios across all 11 languages, capturing global usage patterns. Following rigorous quality control—which involved filtering out substandard or erroneous prompts—we create a dataset of approximately 3,000 diverse conversational scenarios. A few examples from the dataset are shown below.
We evaluate generations of different models using our reward scorer and see that Sarvam-M outperforms other, much bigger models like Llama 4 Scout on this benchmark (see below table).
Language |
Sarvam M (24B) |
Mistral 3 Small (24B) |
Gemma 3 (27B) |
Llama 4 Scout (17B/109B) |
Llama 3.3 (70B) |
Bengali | 8.17 | 7.62 | 7.29 | 7.59 | 7.01 |
English | 8.35 | 8.32 | 7.85 | 8.17 | 8.20 |
Gujarati | 8.21 | 7.53 | 7.52 | 7.67 | 6.74 |
Hindi | 8.30 | 8.10 | 7.82 | 7.69 | 7.53 |
Kannada | 7.98 | 7.53 | 7.53 | 7.68 | 6.59 |
Malayalam | 8.19 | 7.50 | 7.46 | 7.68 | 6.96 |
Marathi | 8.17 | 7.38 | 7.48 | 7.97 | 7.12 |
Oriya | 7.82 | 3.43 | 6.52 | 6.46 | 5.68 |
Punjabi | 8.15 | 7.49 | 7.48 | 7.63 | 6.96 |
Tamil | 7.92 | 7.40 | 7.55 | 7.30 | 6.56 |
Telugu | 8.05 | 7.39 | 6.95 | 7.52 | 6.87 |
Average | 8.12 | 7.24 | 7.40 | 7.58 | 6.93 |
Section 4: Inference Optimisation
Deploying an LLM in a cost-effective manner is equally, if not more important than training them effectively. Post-training quantization (PTQ) stands as a particularly promising approach to do this, allowing us to reduce model size and computational requirements without sacrificing performance. We combine this with augmenting LLMs with external knowledge bases to enhance model capabilities by providing access to factual, up-to-date information beyond the training data.
4.1 Post-Training Quantisation
We implemented a standard PTQ pipeline to convert our bfloat16 checkpoints to fp8 format, leveraging the framework-level support for fp8 on Hopper series GPUs. While the PTQ process itself is relatively straightforward, our experiments revealed that the calibration dataset significantly influences quantization outcomes. In our initial experiments, we selected a small calibration dataset consisting of just a few hundred examples, sampling them uniformly from our finetuning data. However, the resulting quantized model exhibited unexpectedly poor performance across evaluation metrics.
After conducting several ablation studies, we identified that successful PTQ requires a larger, more carefully sampled set of prompts that closely matches the distribution of the model's intended usage. When we implemented this improved calibration approach, the resulting quantized model maintained accuracy levels comparable to the original model across our benchmark suite. The table below demonstrates the relationship between calibration dataset size and model performance when using the optimal sampling strategy. The checkpoints calibrated with a smaller number of samples struggles to finish and formulate complete, sensible words.
4.2 Deployment Configurations on H100
To optimize Sarvam-M inference on H100 GPUs, we conducted comprehensive profiling across four key axes: (i) data type (BF16, FP8), (ii) model parallelism, (iii) concurrency, and (iv) lookahead decoding. Our findings confirmed that serving in lower precision data types improves efficiency, with the transition from BF16 to FP8 doubling the output tokens/second for Sarvam-M. Lookahead decoding provided further throughput improvements of up to 2×, though with notable trade-offs:
• all requests in a batch must have the same number of accepted tokens
• at higher concurrencies, the overhead of lookahead decoding becomes the limiting factor
For tensor parallelism, our experiments revealed that for models with 20-30B parameters, configurations beyond TP > 2 yielded diminishing returns in terms of cost per output token. Higher tensor parallelism configurations require more reserved GPUs and introduce additional communication overhead, resulting in non-linear scaling of output tokens/second. After evaluating these trade-offs, we identified the Pareto frontier for cost-effective serving, as shown below.
Based on our analysis, we implemented two distinct optimal configurations for our use case:
• High Concurrency Configuration: Utilizing FP8 quantization and tensor parallelism to achieve ~100 output tokens/second (p50) per user stream at high concurrencies
• Low Concurrency Configuration: Implementing lookahead decoding to reach ~300 output tokens/second (p50) per user stream at lower concurrency levels
These configurations are deployed using our internal inference engine, which consistently outperforms vLLM across all optimization axes.
4.3 Grounding with Wikipedia
Given the relatively compact size of the Sarvam-M model, we anticipated that additional knowledge grounding would enhance its performance. To address this, we developed an efficient Retrieval-Augmented Generation (RAG) infrastructure using Wikipedia. Our approach consists of three key components.
4.3.1 Chunking Strategy Optimization
We evaluated four distinct strategies to chunk the text for creation of a vector database:
• token count-based: chunks text linearly up to a specified number of tokens or paragraph breaks
• recursive: divides documents into progressively smaller chunks until meeting a specified token budget
• semantic: groups document sections based on semantic similarity, potentially bringing together related content from different parts of a document
• semantic double-pass merging (SDPM): similar to semantic chunking but with an additional parameter to specify how far merged chunks can be from each other
To evaluate these strategies, we created a benchmark using randomly selected Wikipedia articles and synthetically generated queries, and evaluated recall at top-k. Our findings showed that both recursive and SDPM strategies performed optimally, with SDPM offering only marginal improvements. Considering the additional computational cost of SDPM in generating embeddings for each document section, we choose the recursive chunking approach, which significantly outperforms the naive token count-based method. An important additional optimization is ensuring that tables in Wikipedia articles remain intact rather than being split across multiple chunks.
4.3.2 Embedding Model Selection
We evaluate multiple models which lead the MTEB list and shortlist two candidate models: bge-m3 and bge-multilingual-gemma2. Using an internal benchmark comprising chit-chat, keyword-based, and factual queries, we find that bge-multilingual-gemma2 surpasses bge-m3 by a substantial margin of 35% on recall@k metrics. However, this performance improvement comes with trade-offs: 3× larger storage requirements for embeddings and increased lookup latency.
To mitigate these costs, we explore various quantization approaches. Binary quantization proves to be too aggressive, resulting in significant performance degradation. Scalar 8-bit quantization emerged as the optimal balance between cost-efficiency and performance. We utilize the Milvus vector database to optimize sustained latency under high concurrency conditions.
4.3.3 Performance Evaluation with Wikipedia Grounding
To evaluate the model's performance with Wikipedia lookup, we implement a routing system where, upon receiving a query, the model determines whether a lookup is necessary, and if so, generates the specific query to use. This system functions cross-lingually, allowing the model to generate English Wikipedia queries regardless of the input language.
For accuracy measurement, we employ the SimpleQA Benchmark with Llama 3.3 70B as a judge. Additionally, we create SimpleQA-IN by translating the SimpleQA prompts into 10 Indic languages using Gemini 2.5 Pro. The comparative results are presented below.
We see a large improvement in factual accuracy across both English and Indic language queries. When equipped with the Wikipedia lookup capability, Sarvam-M's performance on the SimpleQA benchmark increases substantially, with correct responses jumping from a mere 5% to 72%, outperforming OpenAI's o3 model (49%). Similarly, on the SimpleQA-IN benchmark covering 10 Indic languages, the KB based setup achieves 59% correct responses, surpassing both o3-mini (11%) and o3 (47%).
5 Failed Experiments
5.1 Tokenizer Extension
To decrease the fertility scores of Indian languages and improve inference throughput, we extended Mistral Small's vocabulary with additional Indian language tokens. This approach resulted in a significant drop in the model's knowledge base—a degradation that persisted even after extensive SFT. This suggests the need to pre-train tokens more extensively.
5.2 Tokenizer Transplant
We also explored soft distillation from larger models, which requires identical vocabularies between teacher and student models. Since the student model (Mistral Small) and potential teachers (Llama 3.3 70B, Deepseek R1, etc.) use different vocabularies, we use the following transplantation approach:
• For tokens existing in both vocabularies, we directly copied the original embeddings from the student model to preserve exact representations.
• For tokens unique to the teacher model, we identified k-nearest neighbors among common tokens between teacher and student, then used the student's embeddings of these neighbors to approximate new token embeddings through barycentric interpolation.
• For special cases like byte tokens, we implemented fallback logic such as prefix matching before resorting to approximation methods.
The merged embeddings maintained the student's embedding dimension while expanding to accommodate the teacher's vocabulary, effectively transplanting the teacher's tokenization capabilities while preserving the student's semantic representation space.
With aligned vocabularies, we fine-tuned the student to learn the log-probability distribution of the teacher using its top-20 logits. We employed a combination of cross-entropy and Kullback-Leibler divergence losses for optimization. Though the student model rapidly learned to mimic the teacher distribution, it required longer training and more data to recover performance lost due to vocabulary changes. And this process did not outperform a simple SFT procedure.
Both the above points reinforced the need to pre-train from scratch models with efficient tokenisers.
5.3 RL with LLM-Based Rewards
We explored using LLM-based verification as a reward signal in RLVR for programming tasks, but this did not yield consistent performance improvements. A key limitation was the non-deterministic nature of these evaluations, which introduced instability in reward attribution. Despite attempts to mitigate this through well-defined scoring rubrics and partial reward schemes, these approaches proved largely ineffective. However, these experiments focused on verification with smaller models given the cost implications, and a similar experiment with larger models remains to be done.
6 Conclusion
In this blog, we introduce Sarvam-M, a large language model with a focus on Indian languages and reasoning tasks like math and programming. The model is built on top of Mistral Small and enhanced through supervised finetuning (SFT), reinforcement learning with verifiable rewards (RLVR), and inference optimization. The accuracy and efficiency gains established the value of the various post-training steps and deployment configurations.
In a subsequent drop / blog we will explore grounding this model with knowledge bases and web search. We will also continue to explore different RLVR experiments on new environments closer to real-world settings where we are deploying the models.