Sarvam AI

Text to Speech for Indian Languages

The most natural AI voice generator for Hindi, Tamil, Telugu, Bengali, and 7 more Indian languages. Convert text to voice with 35+ speakers, sub-250ms streaming, emotion control, and Hinglish code-switching. Try the text to speech converter below or integrate via API.

· 15 min read

At a glance

  • What: Text to speech (AI voice generator / speech synthesis) API for 11 Indian languages
  • Model: Bulbul V3 — 35+ voices, 48kHz full-band, emotion control, voice cloning. Top-ranked in Josh Talks blind study (20K+ votes)
  • Latency: Sub-250ms first-byte via WebSocket streaming
  • Features: Hinglish/Tanglish code-switching, Indian name pronunciation, pace/pitch control, 8 audio formats
  • Pricing: Free tier with 1,000 credits, then Rs. 30 per 10K chars. See pricing
  • Try it: Free playground below — no signup needed

Voices

View all
ShubhMale
ShreyaFemale
MananMale
IshitaFemale
Try a sample
45 words210/2000

What is text to speech?

Text to speech (TTS) — also called text to voice, speech synthesis, or AI voice generation — is a technology that converts written text into spoken audio using artificial intelligence. A text to speech converter takes any written input and produces natural, human-sounding audio on demand. Modern AI voice generators use deep neural networks to generate speech waveforms directly from text, producing audio that sounds fluid and expressive, with realistic intonation and rhythm.

For Indian languages, text to speech presents challenges that English-centric models were never designed to solve. Hindi alone uses Devanagari script with conjunct consonants, nasalized vowels, and retroflex sounds that have no direct equivalent in English phonetics. Tamil has an agglutinative grammar where a single word can encode what English needs an entire phrase to express. Telugu and Kannada share Dravidian roots but have distinct prosodic patterns. And then there is the reality of how Indians actually communicate: mixing Hindi and English mid-sentence (Hinglish), dropping Tamil and English together (Tanglish), or weaving Bengali with English technical terms (Benglish). Any serious AI voice generator for India must handle this code-switching natively, not as an afterthought.

Sarvam's approach is different from global cloud providers who train primarily on English data and then extend to Indian languages. Bulbul V3, the model behind Sarvam TTS, is trained from the ground up on Indian speech data collected across 11 languages. The model learns the natural rhythms of Hindi conversations, the melodic contours of Tamil narration, and the cadence of Bengali storytelling from real speakers. In an independent blind study by Josh Talks, over 500 annotators cast 20,000+ votes across 11 languages — and Bulbul V3 was the most-preferred model for naturalness, beating ElevenLabs and Cartesia Sonic in both full-band (48 kHz) and telephony (8 kHz) evaluations.

How Sarvam TTS works

Text normalization

Before any audio is generated, the input text goes through a normalization stage that expands abbreviations, formats numbers, and resolves ambiguities. In English, this is relatively straightforward. In Indian languages, it gets complicated fast. Consider a sentence like "Dr. Lal PathLabs ka appointment 18 Jan ko 7:30 AM pe hai." The system needs to know that "Dr." should be spoken as "Doctor" (not "D-R"), that "18 Jan" becomes "athaarah January," that "7:30 AM" becomes "saadhe saat baje subah," and that "PathLabs" is a proper noun that should not be translated.

Sarvam's normalizer handles Indian names, addresses (including PIN codes and landmark references), currency amounts in rupees, phone numbers in the Indian 10-digit format, and mixed-script text where Devanagari and Latin characters appear in the same sentence. This preprocessing step is what makes the difference between robotic output and speech that sounds like a real person reading the text aloud.

Prosody and intonation

Prosody is the rhythm, stress, and intonation pattern of speech. It is what makes a question sound like a question and a statement sound like a statement. It is also what separates a natural-sounding voice from a flat, robotic one. Hindi prosody is fundamentally different from English: stress patterns are more evenly distributed across syllables, and the pitch contour of a Hindi sentence follows a different arc than its English translation. When a TTS model trained on English data generates Hindi, it often applies English stress patterns to Hindi words, producing audio that is technically intelligible but sounds off to native speakers.

Bulbul V3 uses an LLM-based text analysis layer that automatically infers where to emphasize, when to pause, and how to modulate tone and pacing — before any audio is generated. The model has separate prosodic representations for each of the 11 supported languages, so Hindi narration has the rising-falling pattern characteristic of North Indian speech, while Tamil output follows the more syllable-timed rhythm of Dravidian languages. This per-language prosody modeling is why Sarvam achieved the highest listener preference in the Josh Talks blind evaluation — 20,000+ votes from 500+ annotators across 11 languages confirmed that Bulbul V3 sounds more natural than ElevenLabs, Cartesia, and other competitors.

Code-switching

An estimated 350 million Indians speak English as a second language, and the vast majority of them mix it freely with their primary language. A customer support agent might say "Aapka order dispatch ho gaya hai, expected delivery by tomorrow evening." A teacher might explain "Friction ka coefficient zyada matlab zyada resistance, simple." This is not broken grammar. It is normal Indian communication, and any TTS system that cannot handle it will produce jarring, unnatural output every time a language boundary appears.

Sarvam handles code-switching at the model level, not through a pipeline that detects language boundaries and switches between separate engines. When Bulbul V3 encounters a Hinglish sentence, it generates audio in a single pass with smooth transitions between Hindi and English segments. No pauses, no voice quality changes, no accent shifts. The same applies to Tanglish, Benglish, Marathi-English, and other mixed-language patterns. This is particularly valuable for voice agents and IVR systems where callers naturally speak in mixed languages.

Voice selection and customization

Sarvam offers 35+ distinct voices across all 11 supported languages, sourced from professional voice artists and available at up to 48 kHz full-band quality. Voices are organized into categories — Edtech, Customer Support, Advertising & Announcements, Storytelling & Narration, and Social Media Content — so you can pick a voice that matches your use case rather than browsing a generic list. Each voice has a unique personality: Arjun works well for authoritative banking communications, Meera suits warm customer service interactions, and Ritu brings energy to expressive storytelling. Beyond voice selection, you can adjust pace (0.5x to 2x), pitch, and emotional expressiveness through simple API parameters. For enterprises that need a branded voice, Sarvam offers consent-based voice cloning with built-in safeguards — provide a short speech sample (30-60 seconds), give explicit consent, and the system creates a custom voice that maintains Bulbul V3's natural expressiveness. Browse the full catalog at /text-to-speech/voices.

Text to speech use cases

TTS is not a single-use technology. It powers everything from customer calls to YouTube channels to government services. Below is how organizations across India are using Sarvam TTS in production.

Voice AI and automation

Voice agents are replacing traditional call center recordings with dynamic, context-aware speech. A banking voice agent can read back a customer's account balance in Hinglish — "Aapka EMI payment of rupees 12,345 due hai by 15th March" — without pre-recording any of those specific sentences. In BFSI deployments, Bulbul V3 handles loan collection calls with financial terminology (EMI, credit records, late charges) across Hindi, Kannada, and other languages. Healthcare voice agents confirm appointments with complex medical terms like "Comprehensive Thyroid Profile with Anti-TPO Antibodies test" without mispronouncing them. Sarvam's sub-250ms streaming latency means callers hear responses without awkward pauses.

IVR systems use TTS to generate dynamic menu prompts that change based on context. Instead of maintaining thousands of pre-recorded audio files for every possible menu option in every language, telecom and banking companies generate prompts on the fly. A single API call creates "Aapka current balance hai rupees 12,345. PIN change karne ke liye 1 dabayein" in natural Hindi.

Voice notifications deliver OTP codes, appointment reminders, delivery updates, and payment confirmations as spoken calls. For the hundreds of millions of Indian users who are more comfortable with voice than text, spoken notifications have significantly higher engagement than SMS. TTS makes it possible to personalize every call without recording each message individually.

Content creation

YouTube creators in India are using TTS to produce videos faster. Educational channels, news aggregators, and storytelling accounts generate Hindi, Tamil, or Telugu narration from scripts without needing a recording studio. A creator who publishes daily can write a script and have broadcast-ready audio in seconds.

Podcast production becomes accessible to anyone with a script. Writers, journalists, and educators can turn articles into audio episodes in any Indian language. The AI voices are natural enough for extended listening, which matters for podcast formats where listeners spend 20-40 minutes with a single voice.

Audiobook creation at scale is now feasible for Indian language publishers. Recording a full audiobook traditionally takes weeks and costs lakhs. With TTS, a 300-page book can be converted to audio in hours. Sarvam's expressive voices with emotion control produce audiobooks that listeners actually enjoy, not the flat robotic output that gave early TTS audiobooks a bad reputation.

Voiceover for explainer videos, product demos, corporate presentations, and documentary narration can be generated in 11 languages from a single script. Production teams that previously needed separate voice artists for each language can now localize content in minutes.

Enterprise and education

Dubbing and localization teams use TTS to create first-draft voiceovers for video content that needs to reach multiple Indian language audiences. A marketing video produced in English can be localized to Hindi, Tamil, and Telugu in minutes. Professional dubbing studios use TTS as a reference track; content teams with smaller budgets use it as the final output.

E-learning platforms use TTS to narrate courses in regional languages. India's National Education Policy emphasizes mother-tongue instruction, and TTS makes it economically viable to offer the same course in 11 languages without recording 11 separate voiceover tracks. Students retain more information when they learn in their first language.

Corporate training content reaches a wider workforce when narrated in regional languages. A compliance training module for a bank with branches across India can be generated in Hindi, Marathi, Tamil, and Bengali from a single script. Updates to policies or procedures are reflected instantly without re-recording.

Presentations with embedded narration are more engaging than slide decks alone. Sales teams, trainers, and educators add TTS narration to their slides so the content can be consumed asynchronously without a live presenter.

Advertising teams produce radio spots, digital audio ads, and video ad voiceovers at scale. A national campaign that needs to run in 8 languages can generate all the audio variants from a single script, test different voices and tones, and iterate in hours instead of weeks.

Accessibility

Accessibility is one of the most important applications of text to speech. For visually impaired users, TTS enables access to websites, documents, and digital services. For users with reading difficulties, it provides an alternative way to consume written content. India has over 60 million people with visual impairments and hundreds of millions who are more comfortable with spoken information than written text. Sarvam TTS supports screen reader integration and can narrate content in the user's preferred Indian language at adjustable speeds.

Text to speech in 11 Indian languages

Sarvam TTS supports 11 Indian languages: Hindi, Tamil, Telugu, Bengali, Malayalam, Marathi, Gujarati, Kannada, Punjabi, Odia, and Assamese. Together, these languages cover over 95% of India's population. Each language has multiple voices optimized for that language's phonetic system and prosodic patterns. Click any language below to hear samples and see API integration details for that specific language.

How to convert text to speech with the API

Getting started

Sign up at sarvam.ai/try/tts-api to get your API key. Install the Python SDK with pip install sarvamai or the Node.js SDK with npm install sarvamai. Both SDKs use an OpenAI-compatible interface, so if you have integrated any LLM API before, the pattern will feel familiar. Your first TTS generation takes under 5 minutes from signup to working audio.

Python
from sarvamai import SarvamAI

client = SarvamAI(
  api_subscription_key="YOUR_KEY"
)

audio = client.text_to_speech.convert(
    text="Namaste, yeh ek test hai.",
    target_language_code="hi-IN",
    model="bulbul:v3",
    speaker="meera"
)

with open("output.mp3", "wb") as f:
    f.write(audio.audios[0])
Node.js
import SarvamAI from "sarvamai";
import { writeFileSync } from "fs";

const client = new SarvamAI({
  apiSubscriptionKey: "YOUR_KEY"
});

const audio = await client.textToSpeech
  .convert({
    text: "Namaste, yeh ek test hai.",
    targetLanguageCode: "hi-IN",
    model: "bulbul:v3",
    speaker: "meera"
});

writeFileSync("output.mp3",
  audio.audios[0]);

The REST API handles batch generation for up to 2,500 characters per request. For real-time applications like voice agents, use the WebSocket streaming API for sub-250ms first-byte latency. Full API reference, code examples, and integration guides are available at docs.sarvam.ai. Explore the developer hub for SDKs, tutorials, and community resources.

Pricing: Rs. 30 per 10,000 characters on the standard plan. A free tier with 1,000 credits is included for evaluation. Enterprise volume discounts are available. See full details at /api-pricing.

Sarvam's voices carry emotion, handle code-switching between Hindi and English mid-sentence, pronounce Indian names correctly, and read abbreviations and numbers naturally. Listen to the samples below to hear the difference between a model built for Indian languages and one adapted from English.

Hear Indian text to speech quality

Emotion-rich and human-like voices

00:00

That was so funny lol! रिया ने जो किया उसके बाद मेरी हँसी रुक ही नहीं रही..

Effortless language switching

00:00

Hello… मैं Suresh बोल रहा हूँ ABC Finance से.

Authentic pronunciation of Indian names

00:00

Netaji Subhash Marg से Dayanand Road की तरफ,

Natural in abbreviations, acronyms and numbers

00:00

Hello! मैं Ankit बोल रहा हूँ Dr. Lal PathLabs से।

Text to speech benchmarks

Bulbul V3 is evaluated on two axes: naturalness (how human it sounds) and robustness (how accurately it renders text). For naturalness, an independent blind study by Josh Talks used 50-70 annotators per language, generating over 20,000 votes from 500+ participants. Bulbul V3 was the most-preferred model in both full-band (48 kHz) and telephony (8 kHz) categories, beating ElevenLabs v3 alpha, ElevenLabs v2.5 flash, and Cartesia Sonic-3. For robustness, Character Error Rate (CER) measures accuracy across Indian-specific domains: numerics, STEM terms, Indian named entities, code-mixing, Romanized text, and abbreviations. Bulbul V3 achieves the lowest CER across every domain. The benchmark dataset is publicly available on HuggingFace.

Listener preference rate (8kHz)

Higher is better

Competitor win rate
Tie rate
Bulbul V3 win rate

ElevenLabs Flash V2.5

10.37
11.68
77.95

ElevenLabs V3 Alpha

28.14
28.21
43.64

Cartesia Sonic-3

29.43
30.49
40.08
0%20%40%60%80%100%

Text to speech: Sarvam vs Google, AWS, Azure

Feature Sarvam Google Cloud TTS AWS Polly Azure Speech
Indian languages supported 11 9 1 (Hindi) 10
Code-switching (Hinglish, Tanglish)
Indian name pronunciation
Emotion and style control
Data sovereignty (India)
Sub-300ms streaming latency
Voice cloning (consent-based)
Free tier 1,000 credits 1M chars/mo 5M chars/mo 500K chars/mo

The comparison above focuses on features that matter specifically for Indian language deployments. On raw English TTS quality, all four providers are competitive. The gap opens up on Indian languages. Google and Azure support a reasonable number of Indian languages, but the voices are generated by models trained primarily on English data with Indian language fine-tuning. This shows up in three ways: prosody that follows English patterns instead of native patterns, code-switching that breaks at language boundaries, and Indian names pronounced with English phonetic rules.

Sarvam's advantage is architectural. Bulbul V3 is not an English model adapted for Indian languages. It is a model built on Indian speech data from the start. Beyond quality, Bulbul V3 has the lowest rates of word skips and mispronunciations among tested models — critical for enterprise deployments where a skipped word in a banking notification or a mispronounced medication name is not acceptable. The CER benchmark dataset is publicly available on HuggingFace, covering numerics, STEM terms, Indian named entities, code-mixing, Romanized text, and abbreviations. For enterprises evaluating TTS providers for Indian-market deployments, the data sovereignty difference is also significant: Sarvam processes all data within India, which is a hard requirement for companies regulated by RBI, IRDAI, or SEBI. Full pricing comparison is available on the pricing page.

Indian enterprises in banking, insurance, telecom, and government operate under strict regulatory requirements. RBI mandates data localization for financial data. IRDAI requires audit trails for customer communications. DPDP Act governs how personal data is processed. Sarvam is built to meet these requirements: SOC 2 Type II and ISO 27001 certified, all data processed within India, no cross-border transfers, no training on customer data, and full audit-ready logging for every API call.

Enterprise-ready. Data stays in India.

Compliance, control, and data sovereignty. Not bolted on. Built in from day one.

No training on your data

Your API inputs are never used for model training. Zero data retention after processing unless you explicitly request it.

  • Data deleted after processing by default
  • Opt-in retention with configurable TTL
  • Separate data and model training pipelines
  • Full DPDP compliance

Deploy on your terms

All processing happens within India. No cross-border transfers. For regulated workloads, we support VPC and on-premise deployment.

  • India-only data processing
  • VPC and on-premise options
  • Consent-based voice cloning
  • Content safety filters built in

Security and governance

Every API call is logged and traceable. Role-based access, audit trails, and data residency controls built into the platform.

SOC 2 Type IIISO 27001DPDP compliantRole-based accessFull audit trailData residency controls

Text to speech: frequently asked questions

Convert text to speech for free 35+ AI voices, 11 Indian languages, no signup needed