# Text-to-Speech API for Indian Languages

Sarvam's Text-to-Speech API turns text into voices that feel human, carry emotion, and sound natural in every interaction. Powered by Bulbul V3, it delivers the most accurate text-to-speech for Indian languages with 35+ natural voices across 11 languages.

## Key Metrics

- 2B+ characters generated daily
- 11 Indian languages supported
- 35+ unique voices
- 80,000+ developers on the platform
- Sub-250ms first byte latency with WebSocket streaming

---

## Voice Capabilities

### Emotion-Rich and Human-Like Voices

Delivers expressive, emotionally nuanced speech for natural listening experiences. Voices handle conversational Hinglish, casual fillers, emotional expression, and natural code-switching between languages within the same utterance.

### Effortless Language Switching

Seamlessly transitions between languages within the same conversation or phrase. Handles Hindi-English code-switching in customer support calls, Marathi-English in service interactions, and mixed-language content across all supported languages.

### Authentic Pronunciation of Indian Names

Correct, contextually accurate pronunciation of Indian names, places, and terms. Handles navigation directions with Indian place names (Netaji Subhash Marg, Paranthe Wali Gali), train announcements (Mumbai Central Rajdhani Express), and regional proper nouns.

### Natural Handling of Abbreviations, Acronyms, and Numbers

Reads abbreviations, acronyms, and numbers with clarity and correctness. Handles medical terms (Anti-TPO Antibodies), news acronyms (ISRO, GSAT-30), currency amounts, dates, and technical terminology naturally.

---

## Supported Languages

Sarvam TTS supports 11 Indian languages with native pronunciation:

| Language   | Native Script | Code  |
|------------|---------------|-------|
| Hindi      | हिन्दी          | hi-IN |
| Tamil      | தமிழ்          | ta-IN |
| Bengali    | বাংলা          | bn-IN |
| Telugu     | తెలుగు         | te-IN |
| Kannada    | ಕನ್ನಡ          | kn-IN |
| Malayalam  | മലയാളം        | ml-IN |
| Marathi    | मराठी          | mr-IN |
| Gujarati   | ગુજરાતી        | gu-IN |
| Punjabi    | ਪੰਜਾਬੀ         | pa-IN |
| Odia       | ଓଡ଼ିଆ          | or-IN |
| English    | English (Indian accent) | en-IN |

---

## Voice Library

35+ distinct speaker voices are available, including: Aditya, Ritu, Priya, Neha, Rahul, Pooja, Rohan, Simran, Kavya, Amit, Dev, Ishita, Shreya, Ratan, Varun, Manan, Sumit, Roopa, Kabir, Aayan, Shubh, Ashutosh, Advait, Amelia, and Sophia.

Featured showcase voices:

- **Ritu** (Hindi) -- Expressive, Emotional
- **Neha** (Telugu) -- Expressive, Emotional
- **Ishita** (Kannada) -- Expressive, Emotional
- **Suhani** (Bengali) -- Expressive, Emotional
- **Mani** -- Narrative, Emotional (suited for long-form storytelling)
- **Sneha** -- Narrative, Emotional
- **Nameer** -- Narrative, Emotional
- **Deba** -- Narrative, Emotional
- **Sehaj** -- Narrative, Emotional

---

## Developer Features

### Low Latency Streaming

Sub-250ms first byte with WebSocket streaming for real-time voice applications. Streaming API supports up to 2,500 characters per request and is ideal for voice agents and live applications.

### Configurable Controls

Fine-tune voice pace, expressiveness, and tone to match your brand. Parameters include:

- **Pace**: 0.5x to 2x speed
- **Temperature**: 0.01 to 1.0 for fine-tuned output quality
- **Text preprocessing**: Automatically enabled for better handling of numbers, dates, currencies, and mixed-language content

### Plug-and-Play Integrations

Deploy a voice agent in under 10 minutes with SDKs for Python and Node.js.

### API Options

Two API types are available:

- **REST API**: Instant audio generation, best for quick conversions up to 500 characters
- **Streaming API**: WebSocket-based, real-time low-latency audio generation for voice agents and live applications, supports up to 2,500 characters per request

### Supported Audio Formats

8 audio formats supported: MP3, WAV, AAC, OPUS, FLAC (lossless), PCM (LINEAR16), MULAW, and ALAW. Configurable sample rates: 8kHz, 16kHz, 22.05kHz, or 24kHz.

---

## Benchmarks

Bulbul V3 delivers the lowest character error rates, outperforming global competitors across every category.

### Listener Preference Rate (8kHz)

| Competitor             | Competitor Win | Tie   | Bulbul V3 Win |
|------------------------|---------------|-------|---------------|
| ElevenLabs Flash V2.5  | 10.37%        | 11.68%| 77.95%        |
| ElevenLabs V3 Alpha    | 28.14%        | 28.21%| 43.64%        |
| Cartesia Sonic-3       | 29.43%        | 30.49%| 40.08%        |

---

## Use Cases

### Dubbing and Localization

Natural voiceovers for multilingual media and public communication.

- Public announcements
- Educational content
- Marketing promos and ads
- Podcast and informational videos

### Voice Agents

Real-time, human-like speech for customer-facing and internal agents.

- Customer support
- Sales and lead qualification
- Edtech tutors
- Social and companion bots

### Enterprise Training and Communications

Clear, consistent voice for structured, informational content.

- Company-wide announcements
- Product walkthroughs
- Employee training and enablement

### Accessibility

Sarvam TTS powers applications that make content accessible to visually impaired users and those who prefer audio consumption, across all 11 Indian languages.

---

## Voice Cloning

Voice cloning is available with consent and safeguards for high-volume enterprise cases. The voice artist program allows professional voice actors, content creators, dubbing artists, and anyone with a clear expressive voice to contribute. Voice artists retain full ownership of their voice identity and earn passive income based on usage volume with monthly payouts.

---

## Pricing

**Base plan: Rs. 30 for 10,000 characters.**

- Free trial included, no credit card required
- Get API keys instantly
- Volume discounts available
- Enterprise pricing available
- Flexible pricing plans
- Usage analytics included
- Integration with APIs
- Best for startups

---

## FAQ

**What languages does Text to Speech support?**
The Text to Speech API powered by Bulbul V3 supports 11 Indian languages: Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Punjabi, Odia, and English (Indian accent). Each language supports multiple speaker voices with different characteristics.

**What voices are available?**
With Bulbul V3, there are 35+ distinct speaker voices available, including Aditya, Ritu, Priya, Neha, Rahul, Pooja, Rohan, Simran, Kavya, Amit, Dev, Ishita, Shreya, Ratan, Varun, Manan, Sumit, Roopa, Kabir, Aayan, Shubh, Ashutosh, Advait, Amelia, and Sophia.

**Can I control voice characteristics like pitch and pace?**
Bulbul V3 provides control over voice parameters including pace (0.5x to 2x speed) and temperature (0.01 to 1.0) for fine-tuned output quality. Text preprocessing is automatically enabled for better handling of numbers, dates, currencies, and mixed-language content.

**What audio formats are supported?**
The TTS API supports 8 audio formats: MP3, WAV, AAC, OPUS, FLAC (lossless), PCM (LINEAR16), MULAW, and ALAW. Configurable sample rates at 8kHz, 16kHz, 22.05kHz, or 24kHz depending on quality requirements.

**What are the API options -- REST vs Streaming?**
Two API types are offered: REST API for instant audio generation (best for quick conversions up to 500 characters), and Streaming API via WebSocket for real-time, low-latency audio generation ideal for voice agents and live applications. Streaming supports up to 2,500 characters per request.