Speech to Text for Indian Languages

The most accurate voice to text converter for Hindi, Tamil, Telugu, Bengali, and 18 more Indian languages. Convert audio to text with real-time streaming, speaker diarization, and code-mixing support. Try the speech to text converter below or integrate via API.

· 15 min read

At a glance

  • What: Speech to text (voice to text / audio to text) API for 22 Indian languages
  • Model: Saaras v3, trained on 1M+ hours of real Indian audio, 4-stage training pipeline. Read the technical deep-dive
  • Latency: <150ms time-to-first-token (Fast mode), 4 streaming modes via WebSocket
  • Features: Speaker diarization, code-mixing (Hinglish/Tanglish), word-level timestamps, 4 output modes
  • Pricing: Free credits on signup, then Rs. 1.5/min. See pricing
  • Try it: Free playground below, no signup needed

Try Speech to Text

Speak to see live captions

Want to use this API?

What is speech to text?

Speech to text (STT), also called voice to text, audio to text, or automatic speech recognition (ASR), converts spoken audio into written text using AI. Feed it a phone call, a meeting recording, a voice note, or a live broadcast, and it produces a transcript in real time or after the fact. Modern voice recognition systems use deep neural networks and can handle accents, background noise, multiple speakers, and mid-sentence language switching.

Indian languages are where English-centric models fall apart. Hindi speakers routinely switch to English and back within a single sentence. Tamil is agglutinative, meaning a single word can encode what English needs a full phrase for, which makes word boundary detection much harder. Telugu and Kannada share Dravidian roots but have distinct phonetic systems. And Indian telephony audio is often 8kHz with heavy background noise, nothing like the clean studio recordings global ASR models are optimized for. A voice to text converter that treats these as edge cases will not work in production.

Sarvam started from Indian audio, not English. Saaras v3, the model behind Sarvam's speech to text converter, was trained on over 1 million hours of real Indian speech through a 4-stage pipeline: large-scale pre-training, supervised fine-tuning, reinforcement learning, and post-training for long-tail errors. It achieves 19.31% WER on the IndicVoices benchmark across 10 languages (down from 22% in v2), and outperforms GPT-4o Transcribe, Gemini 3 Pro, Deepgram Nova3, and Scribe v2 on Indian language accuracy. For Indian-accented English, Sarvam validated on the Svarah benchmark (9.6 hours, 117 speakers across 65 districts in 19 Indian states). Whether you need speech to text for Hindi, speech to text for Tamil, or any of the other 20 supported languages, Saaras v3 is the best-performing option on the benchmarks that matter.

Why teams choose Sarvam for speech to text

Streaming-first architecture

Sub-150ms time to first token. Configurable Accurate, Balanced, and Fast modes for every latency requirement.

Code-switching & noise robust

Trained on 1M+ hours of real-world audio. Handles code-mixed speech, noisy telephony, and diverse accents.

Drop-in SDKs

Go live in under 10 minutes with official Python and Node.js SDKs. Pipecat & LiveKit ready.

23 Indian languages

All 22 scheduled languages plus English. Unified multilingual model with automatic language detection.

Beyond raw transcripts

Speaker diarization, word-level timestamps, output format control, and automatic language detection built in.

How Sarvam speech to text works

Language detection and code-switching

Saaras v3 automatically detects the spoken language without requiring you to specify it upfront. When a speaker switches between Hindi and English mid-sentence, or drops Tamil words into an otherwise English conversation, the model follows along without losing accuracy at language boundaries. This is not a pipeline that detects language segments and routes them to separate engines. It is a single unified model that understands multilingual Indian speech as a continuous stream.

An estimated 350 million Indians speak English as a second language, and the vast majority mix it freely with their primary language. A customer might say "Mera order ka status kya hai, I placed it last Tuesday." A doctor might dictate "Patient ko high fever hai, CBC and LFT karwao." This is normal Indian communication, and Saaras v3 transcribes it accurately because it was trained on exactly this kind of speech.

Noise robustness

Most global ASR models are trained on clean, studio-quality audio. Real Indian audio is not clean. Call center recordings come in at 8kHz with compression artifacts. Field interviews have traffic noise. Group discussions have overlapping speakers. Mobile calls have network dropouts. Saaras v3 is trained on exactly this kind of audio, so it holds up where other models degrade.

No pre-processing or noise cancellation needed on your end. The robustness comes from a 4-stage training pipeline: pre-training on diverse acoustic conditions, supervised fine-tuning, reinforcement learning, and a post-training stage that specifically targets long-tail errors and output stability. For voice agent deployments where callers are speaking from busy streets, factory floors, or crowded homes, this is the difference between a system that works in production and one that only works in demos.

Speaker diarization

Speaker diarization answers "who said what?" in a multi-speaker recording. The Batch API identifies speakers and tags each word with a stable speaker ID. You need this for meeting transcriptions (attributing action items), call center analytics (separating agent and customer utterances), and interview recordings (distinguishing interviewer from interviewee).

Diarization works across all 22 supported languages and handles overlapping speech, speaker turns of varying length, and conversations where speakers have similar voice characteristics. The speaker labels are consistent across the entire recording, so Speaker 1 at minute 2 is the same person as Speaker 1 at minute 45.

Output flexibility

Sarvam STT provides four output modes to match different use cases. All modes include word-level timestamps by default.

Transcribe

With formatting and number normalization.

मेरा फोन नंबर है 9840950950

Translate

From Indic languages to English.

My phone number is 9840950950

Transliteration

Indian languages written in English letters.

mera phone number hai 9840950950

Verbatim

Preserves fillers and spoken numbers.

मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero

Voice dictation and typing

Voice dictation (speaking instead of typing) is how a lot of people prefer to interact with devices, especially on mobile. Sarvam's speech recognition powers voice typing in Indian languages: dictate messages, emails, notes, and documents in Hindi, Tamil, Telugu, or any of the 22 supported languages. Unlike generic voice dictation that forces you into English, Sarvam outputs correct Devanagari (or whatever script you are dictating in), adds punctuation automatically, and keeps up with natural speaking speed. If you type faster by talking, or find keyboards difficult, voice dictation closes that gap.

Multi-speaker conversation, transcribed

Play the audio below and watch the transcript highlight in real time. Each speaker gets a distinct color.

Panel Discussion

Every word transcribed. Every speaker identified.

0:00
0:00
0:00 - 0:14
Speaker 1

એ રીતે તમે આગળ વધો પોતાની ધર્મની માન્યતાઓને ને પોતાના અત્યાર સુધી તમે પ્રેક્ટિસ કરતા આવ્યા છો એને છોડીને તમે એક જ નિયમ પ્રમાણે ચાલો તો આ યોગ્ય નહીં હોય કારણ કે આપણે જ્યારે દેશ બન્યો ત્યારે દરેકને આશ્વાસન આપ્યું હતું

0:14 - 0:27
Speaker 1

કે એમને જેવી રીતે દેશમાં શાંતિથી રહેવું હશે તો એમને છૂટ આપીશું પોતાના ધર્મ પ્રમાણે પોતાની જાતિ પ્રમાણે કે જે તેમના નિયમો હોય કોઈ કુટુંબના હોય કે જે હોય ગામના એ પ્રમાણે એ પ્રેક્ટિસ કરી શકશે

0:28 - 0:38
Speaker 1

તો આ હાલ ડિસ્કશનમાં ચાલે છે ને એના ફાયદા ગેરફાયદાનો આપણે ડિસ્કસ કરવાનો છે કે શું આવું કરવું જોઈએ હાલ સમય આવી ગયો છે કે બધાને ફોર્સફુલી

0:35 - 0:36
Speaker 2

શું આવું કરવું જોઈએ

0:39 - 0:58
Speaker 1

આપણે એક જ નિયમ બનાવી દઈએ ને બધાને કહીએ કે તમે આ જ નિયમ ફોલો કરો ચાહે તમે હિન્દુ હોય તો પણ તમે આ જ નિયમ ફોલો કરો તમારા ધર્મ ને છોડી દો મુસલમાન હતા તો એમના ધર્મના જે નિયમો હોય એ છોડો અને આને જ ફોલો કરો શું આ સમય આવી ગયો છે કે આપણે આવો કાનૂન બનાવવો જોઈએ જેવી રીતે અમેરિકામાં યુરોપમાં છે ત્યાં કોઈ ધર્મ ધર્મ નથી જોવામાં આવતું જ્યારે એ લોકો કોર્ટમાં જાય છે

0:59 - 1:04
Speaker 1

ત્યાં કોર્ટના દેશના કાનૂન પ્રમાણે કેસ લો કરવામાં આવે તો શું આવું ઇન્ડિયામાં કરવું જોઈએ આ એના વિશે છે

1:06 - 1:07
Speaker 2

એક દેશ એક કાયદો

1:10 - 1:14
Speaker 2

આપણે કહી શકીએ કે એક દેશ અને એક કાયદો બધાએ આ જ કાયદો માનવું જોઈએ

1:13 - 1:32
Speaker 1

કાયદો માનવો જોઈએ હા માનવો જોઈએ પછી તમે તમારો ધર્મ ગમે તે કહેતું હોય તમે ગમે તે માન્યતામાં માનતા હોવ તમને યોગ્ય લાગતું હોય પણ ફોર્સ કરવામાં આવશે ગવર્મેન્ટ તરફથી કે હવે તમારે આ જ માનવું પડશે તો આ યોગ્ય છે દરેક સાથે આવી રીતે ફોર્સફુલી એમનું ઇમ્પ્લિમેન્ટ કરવું હાલ શું આ સમય યોગ્ય છે લોકો એટલા મેચ્યોર છે

1:32 - 1:34
Speaker 1

આ હાલ ડિસ્કશન ચાલે છે

1:34 - 1:51
Speaker 2

મારા હિસાબથી આમાં જોવા જાય તો બંનેના નુકસાન છે અને બંનેના જ ફાયદા છે આપણે અગર આને ઇમ્પ્લિમેન્ટ કરીએ એક દેશ એક કાયદો તો આપણા ધાર્મિક સ્વતંત્રતામાં બી આપણે હસ્તક્ષેપ થાય છે આપણા સંસ્કૃતિને બી આપણે નુકસાન થાય છે

1:51 - 2:06
Speaker 2

અને આને અમલ કરવું મુશ્કેલ છે કારણ કે લોકો ધર્મના પ્રતિ ભારતમાં વધારે પડતા છે બધાને પોતાનો ધર્મ પહેલો છે એટલે આને લાગુ કરવું તો મુશ્કેલ છે જ અને લોકો તો

2:04 - 2:11
Speaker 3

અને લોકશાહી શાસનથી ભી ડિફરન્ટ થશે ને કે લોકો માટે લોકોથી અને લોકો વડે ચાલતું શાસન

2:11 - 2:35
Speaker 3

તો એના અગેન્સ્ટના પ્રોટોકોલમાં આપણે જતા રહીશું કારણ કે આ લોકોથી તો નહીં જ ચાલતું હોય ને કારણ કે આમાં તો એક એક તમે રૂલ્સ બનાવી દીધું એન્ડ એ બધાના ઉપર તમે થોપી દેવામાં આવે કે તમે લોકો કરો જ બરાબર શાયદ એમ એ લોકો માટે બેનિફિશિયલ હોય પણ ખરા ના બી હોઈ શકે તો એમના ધાર્મિક રિચ્યુઅલ્સ પ્રમાણે એ કનેક્ટ થતું પણ હોય કે ના પી થતું હોય

2:15 - 2:16
Speaker 2

2:35 - 2:43
Speaker 3

તો આ લોકો માટે તો નહીં ચાલતું રહે તો આ લોકશાહીનો ભી ભંગ થાય એવું લાગી રહ્યું છે આમ આ આનાથી તો

2:44 - 2:59
Speaker 2

પણ આમાં ફાયદો બી એ છે કે આમાં કાયદા આમ સરળ થઈ જાય છે અને આમાં લોકો આમાં ગવર્મેન્ટ એવું વિચારે કે ધર્મના ઉપર નાગરિક નાગરિક તરીકે ઓળખ વધારે થાય અને આમાં સ્ત્રી પુરુષની સમાનતા બી થાય

3:00 - 3:08
Speaker 2

આપણને જેન્ડર ઇક્વાલિટી બી મળે આમાં સ્ત્રી પુરુષને બરાબર બી આવી શકે છે એ એક પોઇન્ટ છે જે મને ગમ્યો આમાં

3:07 - 3:07
Speaker 1

પોઈન્ટ છે

3:08 - 3:22
Speaker 1

મારે એવું માનવું છે કે આની અંદર બધા જે વિષયો છે એક સાથે લઈ લેવા કેમ કે જમીનનો ઇસ્યુ હોય અથવા તો શાદીનો હોય બીજા બધા જે ઇસ્યુઝના હોય એના કરતાં ગવર્મેન્ટે અમુક જે કોમન વિષયો હોય

3:22 - 3:34
Speaker 1

કે જેમાં કોઈ પ્રોબ્લેમ દરેકને વધારે થાય એવું નથી ધર્મની રીતે જેમ કે મારે કઈ પ્રોપર્ટી ખરીદવી છે એના માટેના નિયમો હોય કોઈ બિઝનેસના રિલેટેડ હોય અથવા બીજા કોઈ હોય તો પહેલાં શરૂઆત એમણે આનાથી કરવી જોઈએ

Speech to text use cases

Speech to text shows up everywhere: voice agents, compliance archives, YouTube captions, meeting notes. Whether you need a batch audio to text converter for recorded files or real-time voice to text for live interactions, here is how organizations across India are using Sarvam in production.

Voice agents

Real-time transcription for live voice agents and customer interactions.

Customer support
Sales & lead qualification
Edtech tutors
Social & companion bots
Post-call analytics

Analyze call transcripts to extract insights, sentiment, and actionable intelligence.

Sentiment analysis
Quality assurance & training
Compliance & record keeping
Performance metrics & insights
Subtitling & captioning

Generate accurate subtitles for video content across languages.

OTT platforms
YouTube creators
Corporate training
Live broadcasts

Beyond the use cases above, speech to text is essential for accessibility: real-time captions for hearing-impaired users, voice dictation for people with motor disabilities, and multimodal input for the hundreds of millions of Indians who are more comfortable speaking than typing. Sarvam supports real-time captioning in 22 Indian languages, making digital content accessible to audiences that global STT providers underserve.

Speech to text in 22 Indian languages

Sarvam's voice to text converter supports all 22 scheduled Indian languages plus English, covering over 99% of India's population. Whether you need speech to text in Hindi, speech to text in Tamil, voice to text in Telugu, or audio to text in Bengali, all languages are handled by a single unified model with automatic language detection and native code-mixing support. Click any language below to see transcription samples and API details.

How to convert speech to text with the API

Getting started

You can convert speech to text online using the playground above, or integrate via API for production use. Sign up at sarvam.ai/try/stt-api to get your API key. Install the Python SDK with pip install sarvamai or the Node.js SDK with npm install sarvamai. Your first audio to text conversion takes under 5 minutes from signup to working output.

from sarvamai import SarvamAI

client = SarvamAI(
  api_subscription_key="YOUR_KEY"
)

response = client.speech_to_text.transcribe(
    file_path="audio.wav",
    language="hi-IN",
    model="saaras:v3"
)

print(response.transcript)

Sarvam offers three API variants. The REST API handles files under 30 seconds with synchronous processing. The Batch API handles files up to 1 hour with speaker diarization and word-level timestamps. The Streaming API provides real-time transcription via WebSocket for voice agents, live captions, and interactive applications. Full API reference, code examples, and integration guides are available at docs.sarvam.ai. Explore the developer hub for SDKs, tutorials, and community resources.

Pricing: Rs. 1.5 per minute of audio on the standard plan. Free credits are included on signup for evaluation. Enterprise volume discounts are available. See full details at /api-pricing.

From audio to action, in one API

Move from speech to structured transcripts with a single integration.

1.

Capture audio your way

Stream live from a microphone or upload a finished recording. The same API handles both with first token in milliseconds.

2.

Transcribe across 22 Indian languages

One multilingual model, every scheduled language. Code-mixing, accents, and noisy audio handled without per-language routing.

3.

Get more than a transcript

Speaker turns, word-level timestamps, and language tags arrive with every response. No add-ons, no extra calls.

4.

Ship to production

Drop into your stack with official Python and Node SDKs, plus native Pipecat and LiveKit support. Most teams are live the same day.

Hear Indian speech to text accuracy

Sarvam STT handles the audio that breaks other models: code-mixed conversations, noisy telephony, and multi-speaker discussions. Listen to the samples below and see the transcription quality on real Indian audio.

Seamless code-mixing

00:00

HCG MCC Hospital ஒரு ground breaking achievement பண்ணிட்டாங்க.

Telephony-optimized

00:00

नमस्कार डब्ल्यू सी बैंक में संपर्क करने के लिए आपका धन्यवाद

Handle noisy audio

00:00

अच्छा, मौजूदा सरकार का कामकाज आपको कैसा लगता है?

Speech to text performance

<250ms Median latency
100M+ Mins transcribed
>99.5% Uptime

Saaras v3 processes over 100 million minutes of audio monthly across production deployments in banking, telecom, healthcare, and government. The sub-250ms median latency enables real-time voice agents and live captioning. Enterprise SLAs guarantee 99.5% uptime with dedicated infrastructure for high-throughput workloads.

Speech to text: Sarvam vs Google, AWS, Azure

FeatureSarvamGoogle Cloud STTAWS TranscribeAzure Speech
Indian languages supported2281 (Hindi)9
Code-switching (Hinglish, Tanglish)
Speaker diarization
Noisy telephony (8kHz) optimized
Data sovereignty (India)
Real-time streaming
On-premise deployment
Free tierFree credits60 min/mo60 min/mo5 hrs/mo

This comparison focuses on what matters for Indian language deployments. In direct benchmarks, Saaras v3 was evaluated against GPT-4o Transcribe, Gemini 3 Pro, Deepgram Nova3, and Scribe v2 on the IndicVoices dataset. The gap is wide on the 12 less-resourced Indian languages, where competitors either lack coverage entirely or produce heavily degraded transcriptions. Sarvam also uses a 5-metric evaluation framework (WER, LLM-WER, COMET, Intent Score, Entity Preservation) instead of relying on WER alone, because for Indian languages, two transcriptions with identical WER can have very different semantic accuracy.

The difference is architectural. Saaras v3 is not an English model adapted for India. It was built on Indian speech data from the start, covering all 22 scheduled languages in a single model. The result is accurate transcription on the audio Indian applications actually produce: noisy, code-mixed, multi-speaker, dialectally diverse. On data sovereignty: Sarvam processes everything within India. For companies regulated by RBI, IRDAI, or SEBI, that is not a feature, it is a requirement. Full pricing comparison on the pricing page.

Enterprise-ready. Data stays in India.

Compliance, control, and data sovereignty. Not bolted on. Built in from day one.

No training on your data

Your API inputs are never used for model training. Zero data retention after processing unless you explicitly request it.

  • Data deleted after processing by default
  • Opt-in retention with configurable TTL
  • Separate data and model training pipelines
  • Full DPDP compliance

Deploy on your terms

All processing happens within India. No cross-border transfers. For regulated workloads, we support VPC and on-premise deployment.

  • India-only data processing
  • VPC and on-premise options
  • Consent-based voice cloning
  • Content safety filters built in

Security and governance

Every API call is logged and traceable. Role-based access, audit trails, and data residency controls built into the platform.

SOC 2 Type IIISO 27001DPDP compliantRole-based accessFull audit trailData residency controls

Indian enterprises in banking, insurance, telecom, and government operate under strict regulatory requirements. RBI mandates data localization for financial data. IRDAI requires audit trails for customer communications. DPDP Act governs how personal data is processed. Sarvam is built to meet these requirements: SOC 2 Type II and ISO 27001 certified, all data processed within India, no cross-border transfers, no training on customer data, and full audit-ready logging for every API call.

Speech to text: frequently asked questions

Convert speech to text for free 22 Indian languages, real-time voice to text, no signup needed