What is text to speech?
Text to speech (TTS) — also called text to voice, speech synthesis, or AI voice generation — is a technology that converts written text into spoken audio using artificial intelligence. A text to speech converter takes any written input and produces natural, human-sounding audio on demand. Modern AI voice generators use deep neural networks to generate speech waveforms directly from text, producing audio that sounds fluid and expressive, with realistic intonation and rhythm.
For Indian languages, text to speech presents challenges that English-centric models were never designed to solve. Hindi alone uses Devanagari script with conjunct consonants, nasalized vowels, and retroflex sounds that have no direct equivalent in English phonetics. Tamil has an agglutinative grammar where a single word can encode what English needs an entire phrase to express. Telugu and Kannada share Dravidian roots but have distinct prosodic patterns. And then there is the reality of how Indians actually communicate: mixing Hindi and English mid-sentence (Hinglish), dropping Tamil and English together (Tanglish), or weaving Bengali with English technical terms (Benglish). Any serious AI voice generator for India must handle this code-switching natively, not as an afterthought.
Sarvam's approach is different from global cloud providers who train primarily on English data and then extend to Indian languages. Bulbul V3, the model behind Sarvam TTS, is trained from the ground up on Indian speech data collected across 11 languages. The model learns the natural rhythms of Hindi conversations, the melodic contours of Tamil narration, and the cadence of Bengali storytelling from real speakers. In an independent blind study by Josh Talks, over 500 annotators cast 20,000+ votes across 11 languages — and Bulbul V3 was the most-preferred model for naturalness, beating ElevenLabs and Cartesia Sonic in both full-band (48 kHz) and telephony (8 kHz) evaluations.