
Introducing Bulbul V3: Natural. Expressive. Production-ready.
Today we're releasing Bulbul V3, our most capable text-to-speech model designed to deliver natural, expressive and production-ready voices for Indian languages.
Voice is how India uses technology. Digital-native platforms onboard gig workers through voice agents, with no forms and no apps to download, just conversation. Students ask AI tutors to explain concepts in their native language. Banks handle customer queries through voice at volumes that would be impossible for human agents alone. Gamers interact with characters who speak back in their language.
These experiences demand Indian voices that are natural, expressive, and production-grade. Indian speech is complex by default. People switch languages mid-sentence. Accents vary by region. Names, abbreviations, and emotions matter as much as words. To work in India, voice has to handle all of this without breaking.
Bulbul V3 was evaluated in an independent third-party blind A/B human listening study across 11 languages. The results show that Bulbul V3 sets a new standard on three dimensions that matter most for real-world speech systems.
- Naturalness, where it achieves high listener preference at 48 kHz and is the most-preferred model against competitors in 8 kHz telephony
- Robustness across real-world input types, validated by low character error rates on challenging inputs such as code-mixing and numerics
- Stability, where it stands out as the most dependable model in its class, with minimal word skips and mispronunciations even in long-form and high-volume use
Together, these advances make Bulbul V3 not just better-sounding, but truly production-ready.
But before we tell you more, try something first.
Listen to the voices below and see if you can tell which one is human and which one is AI. Go with your instinct & then share your score while tagging @SarvamAI!
AI x Human
Listen to audio clips and identify if the voice is AI or human. The person with the higher score wins.
Naturalness based on Listener Preference
If a voice does not feel right in the first few seconds, in its pacing, emphasis, or emotional tone, listeners disengage almost immediately. Bulbul V3 is built on an LLM to analyze text and infer the prosodic elements of natural speech such as emphasis, pauses, tone, and pacing. By understanding context and intent rather than processing words as a simple sequence, it generates speech that sounds natural and aligns with the emotional content of what is being said.
The model automatically infers where to emphasize, when to pause, and how to modulate tone and pacing.
To evaluate model performance, an independent third-party study was conducted by Josh Talks using a blind A/B human listening test across 11 languages. Evaluators compared paired audio samples using identical text generated by our model and leading competitors including ElevenLabs v3 alpha and v2.5 flash and Cartesia Sonic-3, assessing qualities such as naturalness, stability and prosody. Annotators also identified key failure modes in each sample.
The study tested two conditions, General full-band and 8 kHz telephony-grade, representing both studio-quality and real-world scenarios.
The study included 50 to 70 annotators per language, generating approximately 2,000 votes per language, resulting in over 20,000 total votes from more than 500 annotators.
General (full-band) evaluations: ElevenLabs v3 alpha leads on audio quality; Bulbul V3 outperforms Cartesia Sonic-3 and all other competitors.
8 kHz (telephony) evaluations: Bulbul V3 is the clear top performer across all competitors.
*The benchmark dataset for General & Telephony evaluations is shared here
Note: The model also introduces a low-latency streaming output mode, enabling audio to be generated and played back in near real time. This is critical for conversational applications, live interactions, and any experience where responsiveness directly impacts user engagement.
Experience the low latencies on the Sarvam Platform
Production-Grade Stability across Indian Languages
Stability captures the model’s ability to deliver error-free, predictable outputs at scale, a key requirement for enterprise deployments.
To measure this, every audio sample was reviewed by humans and tagged across fine-grained error categories, describing the exact failure mode in each utterance on the benchmark.
Key Metrics Evaluated:
- Missing Words - Percentage of evaluated audio samples where one or more words from the given text are not spoken at all.
- Mispronunciations - Percentage of evaluated audio samples with incorrect pronunciation of one or more words, including both minor mispronunciations and severe/cut-off words that affect clarity.
- Extra Content Errors - Percentage of evaluated audio samples where the speech contains extra spoken content not present in the text (additional words, phrases, or sounds).
Bulbul V3 shows the lowest rates of word skips and mispronunciations, while maintaining comparable performance on extra-content errors. In practice, this stability leads to fewer production failures and more natural conversations as systems scale.
Designed for Complex Indian Speech
Indian speech is complex by default. People switch languages mid-sentence. Accents vary by region. Names, abbreviations, and emotions matter as much as words. To work in India, voice has to handle all of this without breaking. Robustness, for us, means faithfully preserving meaning and intent even when inputs are messy, ambiguous, or structurally complex.e
Bulbul V3 is trained for exactly these challenges. Below is a small preview of its capabilities:
Hinglish (Hindi + English)
Kannada
Hinglish (Hindi + English)
We evaluate robustness by measuring how faithfully the model preserves challenging real-world text (numerics, STEM terms, Indian named entities, code-mixed and Romanized speech, abbreviations, and URLs).
Method (automated): For each input, we generate multiple spoken-form variants and multiple transcription candidates using Gemini 2.5 Pro, compute CER across all pairs, and report the minimum CER. This reduces sensitivity to formatting differences and isolates true content errors.
Result: Bulbul V3 achieves the lowest CER across every Indian-relevant domain, outperforming global TTS systems on numerics, STEM, named entities, code-mixing, Romanized text, and abbreviations.
*The benchmark dataset for category-wise robustness is shared here
In production environments, small errors compound quickly. A wrong digit breaks payment systems. A skipped medicine name erodes trust.
Bulbul V3 minimizes these failure modes, delivering content-accurate, stable speech across the inputs that matter for India-specific use cases.
A New Voice Library Powering Real-World Use Cases
Bulbul V3 introduces 30+ high-quality voices across 11 Indian languages, all sourced from professional voice artists. This ensures voices have the depth, clarity, and emotional range required for immersive experiences, especially in long-form audio and content-heavy use cases.
Experience the voices below
Edtech
3 samplesFor a complete experience with 35+ Voices & 11 Languages checkout the Sarvam Platform
Bulbul V3 supports Voice Cloning
Bulbul V3 supports voice cloning, allowing teams to create custom voices that maintain natural expressiveness and quality. This enables brand-specific voices, consistent character identities, and personalized experiences at scale. We've included a sample below to demonstrate Bulbul V3's capabilities.
Need a custom brand voice at scale?
We offer consent-based voice cloning with built-in safeguards for high-volume enterprise use cases. Reach out at developer@sarvam.ai.
More Languages, More Voices
Bulbul V3 supports 35+ voices across 11 Indian languages. Bulbul V3 will soon expand to support 22 Indian languages, enabling high-quality, natural speech generation across regions, scripts, and accents. Together, Bulbul V3 brings natural speech to every corner of India.
TTS Language Playground
Audio Samples
Transcript
Assamese
গতিকে ইয়াৰ বাবে চৰকাৰে সমাজৰ দৰিদ্ৰ শ্ৰেণীক সহায় কৰিবলৈ বিভিন্ন আঁচনি আৰম্ভ কৰিছে, যাতে যেতিয়া জৰুৰীকালীন অৱস্থা হয়, মানুহে তেওঁলোকৰ জীৱনৰ সঞ্চয় হেৰুৱাব লগা নহয়।
Experience Bulbul V3 on APIs Today
For folks who want to dive straight into listening to our latest voices and test their limits, we have built an interactive no-code experience within the Sarvam Dashboard. To kick things off, we’re giving builders unlimited API access throughout February 2026. Push the model to its limits, experiment at scale, and start building without friction from day one.
- Get started instantly with our no-code, interactive experience on the Sarvam API Platform: Sarvam Platform (We are providing unlimited access to Bulbul V3 until 28th February 2026 for all users)
- Looking to integrate the Bulbul V3 API within your Products/Applications? Check out our API Dev docs! We have ready to go SDKs & code snippets to help you get setup in minutes: Text to Speech - Sarvam API Docs
- Want to enable Bulbul V3 within your Voice Agents on Pipecat / Livekit?
- Building something exciting? Join our Discord Developer Community to stay up to date on new releases, share feedback, and collaborate directly with the Sarvam team: https://discord.com/invite/5rAsykttcs
Bulbul V3 opens up new possibilities across education, healthcare, video intelligence, and beyond. Explore, experiment, and build with it today.
Curious what else we're building? Explore our APIs and start creating.
Curious what else we're building?
Explore our APIs and start creating.