Sarvam AI

Speech to Text for Indian Languages

The most accurate voice to text converter for Hindi, Tamil, Telugu, Bengali, and 18 more Indian languages. Convert audio to text with real-time streaming, speaker diarization, and code-mixing support. Try the speech to text converter below or integrate via API.

· 15 min read

At a glance

  • What: Speech to text (voice to text / audio to text) API for 22 Indian languages
  • Model: Saaras v3 — trained on 1M+ hours of real Indian audio, 4-stage training pipeline. Read the technical deep-dive
  • Latency: <150ms time-to-first-token (Fast mode), 4 streaming modes via WebSocket
  • Features: Speaker diarization, code-mixing (Hinglish/Tanglish), word-level timestamps, 4 output modes
  • Pricing: Free credits on signup, then Rs. 1.5/min. See pricing
  • Try it: Free playground below — no signup needed

Speak to see live captions

Want to use this API?

What is speech to text?

Speech to text (STT) — also called voice to text, audio to text, or automatic speech recognition (ASR) — is a technology that converts spoken audio into written text using artificial intelligence. A speech to text converter takes any audio input (phone calls, meetings, voice notes, podcasts, live broadcasts) and produces an accurate transcript, either in real time or from recordings. Modern voice recognition systems use deep neural networks to handle accents, background noise, multiple speakers, and mid-sentence language switching.

For Indian languages, speech to text presents challenges that English-centric models struggle with. Hindi speakers routinely switch to English and back within the same sentence. Tamil's agglutinative grammar means a single word can encode what English needs an entire phrase to express, making word boundary detection harder. Telugu and Kannada share Dravidian roots but have distinct phonetic systems. And Indian telephony audio is often 8kHz with significant background noise, far from the clean studio recordings that most global ASR models are optimized for. Any serious voice to text converter for India must handle these realities natively, not as edge cases.

Sarvam's approach starts from Indian audio. Saaras v3, the model behind Sarvam's speech to text converter, is trained on over 1 million hours of real Indian speech data through a 4-stage pipeline: large-scale pre-training, supervised fine-tuning, reinforcement learning, and post-training for long-tail errors. The model achieves 19.31% WER on the IndicVoices benchmark across 10 languages — down from 22% in the previous version — and outperforms GPT-4o Transcribe, Gemini 3 Pro, Deepgram Nova3, and Scribe v2 on Indian language accuracy. For Indian-accented English, Sarvam validated on the Svarah benchmark (9.6 hours, 117 speakers across 65 districts in 19 Indian states). Whether you need speech to text for Hindi, speech to text for Tamil, or any of the other 20 supported languages — Saaras v3 is the most accurate option available.

Why teams choose Sarvam for speech to text

Streaming-first architecture

First token in under 150ms. Three inference modes: Accurate, Balanced, and Fast.

Built for real speech

Trained on over 1 million hours of real-world audio. Code-mixed speech, noisy telephony, regional accents.

Integrates in minutes

Official Python and Node.js SDKs. Native support for Pipecat and LiveKit.

22 Indian languages, one model

Every scheduled language plus English, handled by a single multilingual model with automatic detection.

More than a transcript

Speaker diarization, word-level timestamps, output format control, and automatic language detection.

How Sarvam speech to text works

Language detection and code-switching

Saaras v3 automatically detects the spoken language without requiring you to specify it upfront. When a speaker switches between Hindi and English mid-sentence, or drops Tamil words into an otherwise English conversation, the model follows along without losing accuracy at language boundaries. This is not a pipeline that detects language segments and routes them to separate engines. It is a single unified model that understands multilingual Indian speech as a continuous stream.

An estimated 350 million Indians speak English as a second language, and the vast majority mix it freely with their primary language. A customer might say "Mera order ka status kya hai, I placed it last Tuesday." A doctor might dictate "Patient ko high fever hai, CBC and LFT karwao." This is normal Indian communication, and Saaras v3 transcribes it accurately because it was trained on exactly this kind of speech.

Noise robustness

Most global ASR models are trained on clean, studio-quality audio and degrade significantly on real-world recordings. Indian production audio is rarely clean: call center recordings at 8kHz with compression artifacts, field interviews with traffic noise, group discussions with overlapping speakers, and mobile phone calls with network dropouts. Saaras v3 is trained on this exact kind of audio, so it maintains accuracy where other models break down.

The model handles ambient noise, crosstalk, varying microphone quality, and poor network connections without requiring pre-processing or noise cancellation on your end. This robustness comes from a 4-stage training pipeline: large-scale pre-training on diverse acoustic conditions, supervised fine-tuning, reinforcement learning, and a post-training stage specifically targeting long-tail errors and output stability. For voice agent deployments where callers are speaking from busy streets, factory floors, or crowded homes, this robustness is not a nice-to-have. It is what determines whether the system works in production or only in demos.

Speaker diarization

Speaker diarization answers the question "who said what?" in a multi-speaker recording. Sarvam's Batch API identifies different speakers in your audio and tags each word with a stable speaker ID. This is essential for meeting transcriptions where you need to attribute action items, call center analytics where you need to separate agent and customer utterances, and interview recordings where you need to distinguish between interviewer and interviewee.

Diarization works across all 22 supported languages and handles overlapping speech, speaker turns of varying length, and conversations where speakers have similar voice characteristics. The speaker labels are consistent across the entire recording, so Speaker 1 at minute 2 is the same person as Speaker 1 at minute 45.

Output flexibility

Sarvam STT provides four output modes to match different use cases. All modes include word-level timestamps by default.

Transcribe

With formatting and number normalization.

मेरा फोन नंबर है 9840950950

Translate

From Indic languages to English.

My phone number is 9840950950

Transliteration

Indian languages written in English letters.

mera phone number hai 9840950950

Verbatim

Preserves fillers and spoken numbers.

मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero

Voice dictation and typing

Voice dictation — speaking instead of typing — is one of the most natural ways to interact with digital devices. Sarvam's speech recognition engine powers voice typing in Indian languages, letting users dictate messages, emails, notes, and documents in Hindi, Tamil, Telugu, or any of the 22 supported languages. Unlike generic voice dictation tools that force you into English, Sarvam handles dictation in Hindi with correct Devanagari output, adds smart punctuation automatically, and keeps up with natural speaking speed. For users who type faster by talking, or who find keyboards difficult to use, voice-to-text dictation removes the friction between thought and written word.

Multi-speaker conversation, transcribed

Play the audio below and watch the transcript highlight in real time — each speaker gets a distinct color.

Panel Discussion

Every word transcribed. Every speaker identified.

0:00
0:00
0:00 - 0:14
Speaker 1

એ રીતે તમે આગળ વધો પોતાની ધર્મની માન્યતાઓને ને પોતાના અત્યાર સુધી તમે પ્રેક્ટિસ કરતા આવ્યા છો એને છોડીને તમે એક જ નિયમ પ્રમાણે ચાલો તો આ યોગ્ય નહીં હોય કારણ કે આપણે જ્યારે દેશ બન્યો ત્યારે દરેકને આશ્વાસન આપ્યું હતું

0:14 - 0:27
Speaker 1

કે એમને જેવી રીતે દેશમાં શાંતિથી રહેવું હશે તો એમને છૂટ આપીશું પોતાના ધર્મ પ્રમાણે પોતાની જાતિ પ્રમાણે કે જે તેમના નિયમો હોય કોઈ કુટુંબના હોય કે જે હોય ગામના એ પ્રમાણે એ પ્રેક્ટિસ કરી શકશે

0:28 - 0:38
Speaker 1

તો આ હાલ ડિસ્કશનમાં ચાલે છે ને એના ફાયદા ગેરફાયદાનો આપણે ડિસ્કસ કરવાનો છે કે શું આવું કરવું જોઈએ હાલ સમય આવી ગયો છે કે બધાને ફોર્સફુલી

0:35 - 0:36
Speaker 2

શું આવું કરવું જોઈએ

0:39 - 0:58
Speaker 1

આપણે એક જ નિયમ બનાવી દઈએ ને બધાને કહીએ કે તમે આ જ નિયમ ફોલો કરો ચાહે તમે હિન્દુ હોય તો પણ તમે આ જ નિયમ ફોલો કરો તમારા ધર્મ ને છોડી દો મુસલમાન હતા તો એમના ધર્મના જે નિયમો હોય એ છોડો અને આને જ ફોલો કરો શું આ સમય આવી ગયો છે કે આપણે આવો કાનૂન બનાવવો જોઈએ જેવી રીતે અમેરિકામાં યુરોપમાં છે ત્યાં કોઈ ધર્મ ધર્મ નથી જોવામાં આવતું જ્યારે એ લોકો કોર્ટમાં જાય છે

0:59 - 1:04
Speaker 1

ત્યાં કોર્ટના દેશના કાનૂન પ્રમાણે કેસ લો કરવામાં આવે તો શું આવું ઇન્ડિયામાં કરવું જોઈએ આ એના વિશે છે

1:06 - 1:07
Speaker 2

એક દેશ એક કાયદો

1:10 - 1:14
Speaker 2

આપણે કહી શકીએ કે એક દેશ અને એક કાયદો બધાએ આ જ કાયદો માનવું જોઈએ

1:13 - 1:32
Speaker 1

કાયદો માનવો જોઈએ હા માનવો જોઈએ પછી તમે તમારો ધર્મ ગમે તે કહેતું હોય તમે ગમે તે માન્યતામાં માનતા હોવ તમને યોગ્ય લાગતું હોય પણ ફોર્સ કરવામાં આવશે ગવર્મેન્ટ તરફથી કે હવે તમારે આ જ માનવું પડશે તો આ યોગ્ય છે દરેક સાથે આવી રીતે ફોર્સફુલી એમનું ઇમ્પ્લિમેન્ટ કરવું હાલ શું આ સમય યોગ્ય છે લોકો એટલા મેચ્યોર છે

1:32 - 1:34
Speaker 1

આ હાલ ડિસ્કશન ચાલે છે

1:34 - 1:51
Speaker 2

મારા હિસાબથી આમાં જોવા જાય તો બંનેના નુકસાન છે અને બંનેના જ ફાયદા છે આપણે અગર આને ઇમ્પ્લિમેન્ટ કરીએ એક દેશ એક કાયદો તો આપણા ધાર્મિક સ્વતંત્રતામાં બી આપણે હસ્તક્ષેપ થાય છે આપણા સંસ્કૃતિને બી આપણે નુકસાન થાય છે

1:51 - 2:06
Speaker 2

અને આને અમલ કરવું મુશ્કેલ છે કારણ કે લોકો ધર્મના પ્રતિ ભારતમાં વધારે પડતા છે બધાને પોતાનો ધર્મ પહેલો છે એટલે આને લાગુ કરવું તો મુશ્કેલ છે જ અને લોકો તો

2:04 - 2:11
Speaker 3

અને લોકશાહી શાસનથી ભી ડિફરન્ટ થશે ને કે લોકો માટે લોકોથી અને લોકો વડે ચાલતું શાસન

2:11 - 2:35
Speaker 3

તો એના અગેન્સ્ટના પ્રોટોકોલમાં આપણે જતા રહીશું કારણ કે આ લોકોથી તો નહીં જ ચાલતું હોય ને કારણ કે આમાં તો એક એક તમે રૂલ્સ બનાવી દીધું એન્ડ એ બધાના ઉપર તમે થોપી દેવામાં આવે કે તમે લોકો કરો જ બરાબર શાયદ એમ એ લોકો માટે બેનિફિશિયલ હોય પણ ખરા ના બી હોઈ શકે તો એમના ધાર્મિક રિચ્યુઅલ્સ પ્રમાણે એ કનેક્ટ થતું પણ હોય કે ના પી થતું હોય

2:15 - 2:16
Speaker 2

2:35 - 2:43
Speaker 3

તો આ લોકો માટે તો નહીં ચાલતું રહે તો આ લોકશાહીનો ભી ભંગ થાય એવું લાગી રહ્યું છે આમ આ આનાથી તો

2:44 - 2:59
Speaker 2

પણ આમાં ફાયદો બી એ છે કે આમાં કાયદા આમ સરળ થઈ જાય છે અને આમાં લોકો આમાં ગવર્મેન્ટ એવું વિચારે કે ધર્મના ઉપર નાગરિક નાગરિક તરીકે ઓળખ વધારે થાય અને આમાં સ્ત્રી પુરુષની સમાનતા બી થાય

3:00 - 3:08
Speaker 2

આપણને જેન્ડર ઇક્વાલિટી બી મળે આમાં સ્ત્રી પુરુષને બરાબર બી આવી શકે છે એ એક પોઇન્ટ છે જે મને ગમ્યો આમાં

3:07 - 3:07
Speaker 1

પોઈન્ટ છે

3:08 - 3:22
Speaker 1

મારે એવું માનવું છે કે આની અંદર બધા જે વિષયો છે એક સાથે લઈ લેવા કેમ કે જમીનનો ઇસ્યુ હોય અથવા તો શાદીનો હોય બીજા બધા જે ઇસ્યુઝના હોય એના કરતાં ગવર્મેન્ટે અમુક જે કોમન વિષયો હોય

3:22 - 3:34
Speaker 1

કે જેમાં કોઈ પ્રોબ્લેમ દરેકને વધારે થાય એવું નથી ધર્મની રીતે જેમ કે મારે કઈ પ્રોપર્ટી ખરીદવી છે એના માટેના નિયમો હોય કોઈ બિઝનેસના રિલેટેડ હોય અથવા બીજા કોઈ હોય તો પહેલાં શરૂઆત એમણે આનાથી કરવી જોઈએ

Speech to text use cases

Speech to text powers everything from live voice agents to compliance archives to YouTube captions. Whether you need a simple audio to text converter for recorded files or a real-time voice to text solution for live interactions, below is how organizations across India are using Sarvam in production.

Voice agents

Real-time transcription for live voice agents and customer interactions.

Customer support
Sales & lead qualification
Edtech tutors
Social & companion bots
Post-call analytics

Analyze call transcripts to extract insights, sentiment, and actionable intelligence.

Sentiment analysis
Quality assurance & training
Compliance & record keeping
Performance metrics & insights
Subtitling & captioning

Generate accurate subtitles for video content across languages.

OTT platforms
YouTube creators
Corporate training
Live broadcasts

Beyond the use cases above, speech to text is essential for accessibility — real-time captions for hearing-impaired users, voice dictation for people with motor disabilities, and multimodal input for the hundreds of millions of Indians who are more comfortable speaking than typing. Sarvam supports real-time captioning in 22 Indian languages, making digital content accessible to audiences that global STT providers underserve.

Speech to text in 22 Indian languages

Sarvam's voice to text converter supports all 22 scheduled Indian languages plus English, covering over 99% of India's population. Whether you need speech to text in Hindi, speech to text in Tamil, voice to text in Telugu, or audio to text in Bengali — all languages are handled by a single unified model with automatic language detection and native code-mixing support. Click any language below to see transcription samples and API details.

How to convert speech to text with the API

Getting started

You can convert speech to text online using the playground above, or integrate via API for production use. Sign up at sarvam.ai/try/stt-api to get your API key. Install the Python SDK with pip install sarvamai or the Node.js SDK with npm install sarvamai. Your first audio to text conversion takes under 5 minutes from signup to working output.

Python
from sarvamai import SarvamAI

client = SarvamAI(
  api_subscription_key="YOUR_KEY"
)

response = client.speech_to_text.transcribe(
    file_path="audio.wav",
    language="hi-IN",
    model="saaras:v3"
)

print(response.transcript)
Node.js
import SarvamAI from "sarvamai";

const client = new SarvamAI({
  apiSubscriptionKey: "YOUR_KEY"
});

const response = await client.speechToText
  .transcribe({
    filePath: "audio.wav",
    language: "hi-IN",
    model: "saaras:v3"
});

console.log(response.transcript);

Sarvam offers three API variants. The REST API handles files under 30 seconds with synchronous processing. The Batch API handles files up to 1 hour with speaker diarization and word-level timestamps. The Streaming API provides real-time transcription via WebSocket for voice agents, live captions, and interactive applications. Full API reference, code examples, and integration guides are available at docs.sarvam.ai. Explore the developer hub for SDKs, tutorials, and community resources.

Pricing: Rs. 1.5 per minute of audio on the standard plan. Free credits are included on signup for evaluation. Enterprise volume discounts are available. See full details at /api-pricing.

From audio to action, in one API

Move from speech to structured transcripts with a single integration.

1.

Capture audio your way

Stream live from a microphone or upload a finished recording. The same API handles both with first token in milliseconds.

2.

Transcribe across 22 Indian languages

One multilingual model, every scheduled language. Code-mixing, accents, and noisy audio handled without per-language routing.

3.

Get more than a transcript

Speaker turns, word-level timestamps, and language tags arrive with every response. No add-ons, no extra calls.

4.

Ship to production

Drop into your stack with official Python and Node SDKs, plus native Pipecat and LiveKit support. Most teams are live the same day.

Sarvam STT handles the audio that breaks other models: code-mixed conversations, noisy telephony, and multi-speaker discussions. Listen to the samples below and see the transcription quality on real Indian audio.

Hear Indian speech to text accuracy

Seamless code-mixing

00:00

HCG MCC Hospital ஒரு ground breaking achievement பண்ணிட்டாங்க.

Telephony-optimized

00:00

नमस्कार डब्ल्यू सी बैंक में संपर्क करने के लिए आपका धन्यवाद

Handle noisy audio

00:00

अच्छा, मौजूदा सरकार का कामकाज आपको कैसा लगता है?

Speech to text performance

<250ms Median latency
100M+ Mins transcribed
>99.5% Uptime

Saaras v3 processes over 100 million minutes of audio monthly across production deployments in banking, telecom, healthcare, and government. The sub-250ms median latency enables real-time voice agents and live captioning. Enterprise SLAs guarantee 99.5% uptime with dedicated infrastructure for high-throughput workloads.

Speech to text: Sarvam vs Google, AWS, Azure

Feature Sarvam Google Cloud STT AWS Transcribe Azure Speech
Indian languages supported 22 8 1 (Hindi) 9
Code-switching (Hinglish, Tanglish)
Speaker diarization
Noisy telephony (8kHz) optimized
Data sovereignty (India)
Real-time streaming
On-premise deployment
Free tier Free credits 60 min/mo 60 min/mo 5 hrs/mo

The comparison above focuses on features that matter specifically for Indian language deployments. In direct benchmarks, Saaras v3 was evaluated against GPT-4o Transcribe, Gemini 3 Pro, Deepgram Nova3, and Scribe v2 on the IndicVoices dataset. The performance gap widens significantly on the 12 less-resourced Indian languages, where competitors either lack coverage entirely or yield heavily degraded transcriptions. Sarvam also uses a rigorous 5-metric evaluation framework (WER, LLM-WER, COMET, Intent Score, Entity Preservation) rather than relying on standard WER alone — because for Indian languages, two transcriptions with identical WER can have very different semantic accuracy.

Sarvam's advantage is architectural. Saaras v3 is not an English model adapted for Indian languages. It is a model built on Indian speech data from the start, covering all 22 scheduled languages in a single unified model. Whether used as a speech to text API, a voice to text converter, or an audio to text converter for batch processing, the result is transcriptions that are accurate on the audio that Indian applications actually produce: noisy, code-mixed, multi-speaker, and dialectally diverse. For enterprises evaluating speech recognition providers for Indian-market deployments, the data sovereignty difference is also significant: Sarvam processes all data within India, which is a hard requirement for companies regulated by RBI, IRDAI, or SEBI. Full pricing comparison is available on the pricing page.

Indian enterprises in banking, insurance, telecom, and government operate under strict regulatory requirements. RBI mandates data localization for financial data. IRDAI requires audit trails for customer communications. DPDP Act governs how personal data is processed. Sarvam is built to meet these requirements: SOC 2 Type II and ISO 27001 certified, all data processed within India, no cross-border transfers, no training on customer data, and full audit-ready logging for every API call.

Enterprise-ready. Data stays in India.

Compliance, control, and data sovereignty. Not bolted on. Built in from day one.

No training on your data

Your API inputs are never used for model training. Zero data retention after processing unless you explicitly request it.

  • Data deleted after processing by default
  • Opt-in retention with configurable TTL
  • Separate data and model training pipelines
  • Full DPDP compliance

Deploy on your terms

All processing happens within India. No cross-border transfers. For regulated workloads, we support VPC and on-premise deployment.

  • India-only data processing
  • VPC and on-premise options
  • Consent-based voice cloning
  • Content safety filters built in

Security and governance

Every API call is logged and traceable. Role-based access, audit trails, and data residency controls built into the platform.

SOC 2 Type IIISO 27001DPDP compliantRole-based accessFull audit trailData residency controls

Speech to text: frequently asked questions

Convert speech to text for free 22 Indian languages, real-time voice to text, no signup needed