How does speech to text work for Indian languages?

Indian language STT needs models trained on the phonetics and speaking patterns specific to languages like Hindi, Tamil, Telugu, and Bengali. Sarvam's Saaras v3 is trained from scratch on over 1 million hours of real Indian audio, not fine-tuned from an English model. It handles retroflex consonants in Hindi, agglutinative grammar in Tamil, tonal distinctions in Punjabi, and nasalized vowels in Bengali correctly. It also handles code-mixed speech (Hinglish, Tanglish) natively, because that is how most Indians actually speak.

Which Indian languages does Sarvam STT support?

Sarvam STT supports all 22 scheduled Indian languages: Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, Sanskrit, Nepali, Dogri, Bodo, Konkani, Maithili, Sindhi, Kashmiri, Manipuri, and Santali, plus English (Indian accent). All languages are handled by a single unified model with automatic language detection. You can explore language-specific details at /apis/speech-to-text/hindi, /apis/speech-to-text/tamil, and so on.

Is Sarvam speech to text free?

Yes, you can try Sarvam STT for free. New accounts receive free credits on signup to test transcription across languages and audio types. For production use, pricing is usage-based at Rs. 1.5 per minute of audio, with volume discounts available for high-throughput workloads. Enterprise plans include dedicated support, SLA guarantees, and on-premise deployment options. See the full pricing breakdown at /api-pricing.

How do I add speech to text to my Python app?

Install the Sarvam Python SDK with 'pip install sarvamai', then initialize the client with your API key. A basic speech-to-text transcription takes 5 lines of code: create the client, call client.speech_to_text.transcribe() with your audio file path, language code, and model name, then read the transcript from the response. The SDK supports both synchronous REST calls for short audio and WebSocket streaming for real-time applications. Full documentation and code examples are available at docs.sarvam.ai.

What is the best speech to text API for Hindi?

For Hindi, Sarvam's Saaras v3 outperforms Google Cloud Speech-to-Text, AWS Transcribe, and Azure Speech in accuracy tests on real-world Indian audio. The gap is clearest in code-switching (Hinglish works without accuracy drops), noisy audio (trained on real telephony and field recordings, not clean studio audio), and speaker diarization (correct speaker attribution in multi-party conversations). Sarvam also processes all data within India, which matters for companies under RBI data localization rules.

Does Sarvam STT support speaker diarization?

Yes. The Batch API identifies different speakers in your audio and tags each word with a speaker ID, so you can reconstruct who said what. Useful for meeting transcriptions, interview recordings, call center analytics, and anything else with multiple speakers. Diarization works across all 22 supported languages and handles overlapping speech.

How does Sarvam STT handle Hinglish and code-mixed speech?

Saaras v3 handles Hinglish (and Tanglish, Benglish, and other mixed-language patterns) natively. When a speaker says 'Meeting tomorrow hai, please agenda share kar do by evening', the model transcribes both the Hindi and English segments accurately in a single pass. It does not split processing between separate language engines or lose accuracy at language boundaries. This works because the model was trained on real Indian conversational audio where code-switching is the norm.

What audio formats does the STT API support?

The Sarvam STT API accepts 10+ audio formats: WAV, MP3, FLAC, OGG, Opus, AAC, M4A, AMR, WMA, and WebM. Supported sample rates range from 8kHz (telephony) up to 48kHz. Audio is automatically resampled when needed, so you can send whatever format your pipeline produces. For the Streaming API, WAV and raw PCM formats at 16kHz sample rate are supported.

What is the latency for real-time STT streaming?

Sarvam offers four streaming modes: Simulated Streaming (VAD + Saaras v3, available now via WebSocket), Realtime Accurate (lowest WER), Realtime Balanced (reduced latency with strong accuracy), and Realtime Fast (less than 150ms time-to-first-token). The Fast mode is designed for voice agents where every millisecond matters. The encoder is trained from scratch with causal attention on audio of varying chunk sizes, so a single model operates across all latency settings without needing separate variants. Partial transcripts arrive incrementally without disruptive rewrites or lag.

How does Sarvam compare to Google speech to text?

Google Cloud Speech-to-Text supports several Indian languages, but the models are adapted from Google's English-first architecture. Sarvam's Saaras v3 is trained specifically on Indian audio data. The practical differences: Sarvam handles code-mixed speech without accuracy drops, maintains performance on noisy 8kHz telephony audio, and provides more accurate transcription for Indian accents and regional dialects. Sarvam also keeps all data within India (important for RBI-regulated companies) and offers on-premise deployment for air-gapped environments.

What output formats does the STT API provide?

Sarvam STT provides four output modes: Transcription (native script with formatting and number normalization), Translation (Indic-to-English translation of the spoken content), Transliteration (Indian languages written in Roman/English letters), and Verbatim (preserves fillers, hesitations, and spoken-form numbers). All modes include word-level timestamps. Speaker diarization labels are available in Batch API responses.

Can Sarvam STT run on-premise?

Yes. Sarvam offers managed cloud, single-tenant VPC deployment, and on-premise installations. For organizations with data residency, compliance, or air-gapped network requirements, on-premise deployment ensures your audio never leaves your infrastructure. The on-premise version includes the same Saaras v3 model with full feature parity. Contact the sales team at /contact for deployment options.

How accurate is Sarvam STT on noisy audio?

Saaras v3 is trained on real-world audio with background noise, crosstalk, poor connections, and low-bitrate telephony (8kHz). It maintains accuracy across challenging acoustic conditions without requiring pre-processing or noise cancellation on your end. This is particularly important for call center analytics, field recordings, and voice agent deployments where audio quality is unpredictable. The model handles ambient noise, overlapping speakers, and varying microphone quality.

How much does speech to text cost?

Sarvam STT pricing starts at Rs. 1.5 per minute of audio on the standard plan. Free credits are included on signup for evaluation. Enterprise plans include volume discounts, dedicated support, SLA guarantees, and on-premise deployment options. For comparison, Google Cloud Speech-to-Text charges $0.006-$0.009 per 15 seconds (approximately $0.024-$0.036 per minute) for Indian languages. Full pricing details are at /api-pricing.

How do I convert speech to text online for free?

You can convert speech to text online for free using the playground on this page. Click 'Try Free' above, record audio from your microphone or upload an audio file, select your language, and get an instant transcript. No account or credit card is required for the online converter. For API access with programmatic integration, sign up at dashboard.sarvam.ai to get free credits.

What is the best voice to text converter for Hindi?

Sarvam's Saaras v3 is the most accurate voice to text converter for Hindi, trained specifically on Indian audio data including Hinglish code-switching. Unlike Google or Apple's built-in voice typing which applies English phonetic rules to Hindi, Sarvam understands Devanagari script natively, handles Hindi-English mixing mid-sentence, and correctly transcribes Indian names, addresses, and abbreviations. Try it free on the playground above.

Can I use Sarvam as an audio to text converter for long recordings?

Yes. Sarvam's Batch API handles audio files up to 1 hour in length, making it suitable as an audio to text converter for meetings, interviews, podcasts, lectures, and call recordings. The batch endpoint includes speaker diarization, word-level timestamps, and supports all major audio formats (MP3, WAV, FLAC, OGG, M4A, and more). For files longer than 1 hour, split the audio and process segments in parallel.

Does Sarvam support voice dictation in Indian languages?

Yes. The streaming API enables real-time voice dictation in all 22 supported Indian languages. Speak in Hindi, Tamil, Telugu, or any supported language and see text appear as you talk, with sub-250ms latency. Punctuation is added automatically. Works well for voice typing in messaging apps, note-taking, form filling, and anywhere users prefer speaking over typing.

How do I convert speech to text in Python?

Install the Sarvam SDK with 'pip install sarvamai'. Create a client with SarvamAI(api_subscription_key='YOUR_KEY'). Call client.speech_to_text.transcribe(file_path='audio.wav', language='hi-IN', model='saaras:v3'). The response contains response.transcript (full text) and response.words (word-level timestamps). The entire integration takes under 5 minutes. See docs.sarvam.ai for streaming and batch API examples.

Speech to Text for Indian Languages

Q: What is speech to text?

Speech to text (STT), also called automatic speech recognition (ASR), converts spoken audio into written text using AI. You give it an audio file or a live microphone stream, and it gives you a transcript. Modern systems handle accents, background noise, and multiple speakers well enough that manual transcription is mostly unnecessary. The technology is behind voice agents, call center analytics, subtitles, meeting notes, and accessibility tools.

The most accurate voice to text converter for Hindi, Tamil, Telugu, Bengali, and 18 more Indian languages. Convert audio to text with real-time streaming, speaker diarization, and code-mixing support. Try the speech to text converter below or integrate via API.

Updated May 2026 · 15 min read

At a glance

•What: Speech to text (voice to text / audio to text) API for 22 Indian languages
•Model: Saaras v3, trained on 1M+ hours of real Indian audio, 4-stage training pipeline. Read the technical deep-dive
•Latency: <150ms time-to-first-token (Fast mode), 4 streaming modes via WebSocket
•Features: Speaker diarization, code-mixing (Hinglish/Tanglish), word-level timestamps, 4 output modes
•Pricing: Free credits on signup, then Rs. 1.5/min. See pricing
•Try it: Free playground below, no signup needed

Try Speech to Text

Speak to see live captions

Want to use this API?

What is speech to text?

Speech to text (STT), also called voice to text, audio to text, or automatic speech recognition (ASR), converts spoken audio into written text using AI. Feed it a phone call, a meeting recording, a voice note, or a live broadcast, and it produces a transcript in real time or after the fact. Modern voice recognition systems use deep neural networks and can handle accents, background noise, multiple speakers, and mid-sentence language switching.

Indian languages are where English-centric models fall apart. Hindi speakers routinely switch to English and back within a single sentence. Tamil is agglutinative, meaning a single word can encode what English needs a full phrase for, which makes word boundary detection much harder. Telugu and Kannada share Dravidian roots but have distinct phonetic systems. And Indian telephony audio is often 8kHz with heavy background noise, nothing like the clean studio recordings global ASR models are optimized for. A voice to text converter that treats these as edge cases will not work in production.

Sarvam started from Indian audio, not English. Saaras v3, the model behind Sarvam's speech to text converter, was trained on over 1 million hours of real Indian speech through a 4-stage pipeline: large-scale pre-training, supervised fine-tuning, reinforcement learning, and post-training for long-tail errors. It achieves 19.31% WER on the IndicVoices benchmark across 10 languages (down from 22% in v2), and outperforms GPT-4o Transcribe, Gemini 3 Pro, Deepgram Nova3, and Scribe v2 on Indian language accuracy. For Indian-accented English, Sarvam validated on the Svarah benchmark (9.6 hours, 117 speakers across 65 districts in 19 Indian states). Whether you need speech to text for Hindi, speech to text for Tamil, or any of the other 20 supported languages, Saaras v3 is the best-performing option on the benchmarks that matter.

Why teams choose Sarvam for speech to text

Streaming-first architecture

Sub-150ms time to first token. Configurable Accurate, Balanced, and Fast modes for every latency requirement.

Code-switching & noise robust

Trained on 1M+ hours of real-world audio. Handles code-mixed speech, noisy telephony, and diverse accents.

Drop-in SDKs

Go live in under 10 minutes with official Python and Node.js SDKs. Pipecat & LiveKit ready.

23 Indian languages

All 22 scheduled languages plus English. Unified multilingual model with automatic language detection.

Beyond raw transcripts

Speaker diarization, word-level timestamps, output format control, and automatic language detection built in.

Streaming-first architecture

Sub-150ms time to first token. Configurable Accurate, Balanced, and Fast modes for every latency requirement.

Code-switching & noise robust

Trained on 1M+ hours of real-world audio. Handles code-mixed speech, noisy telephony, and diverse accents.

Drop-in SDKs

Go live in under 10 minutes with official Python and Node.js SDKs. Pipecat & LiveKit ready.

23 Indian languages

All 22 scheduled languages plus English. Unified multilingual model with automatic language detection.

Beyond raw transcripts

Speaker diarization, word-level timestamps, output format control, and automatic language detection built in.

How Sarvam speech to text works

Language detection and code-switching

Saaras v3 automatically detects the spoken language without requiring you to specify it upfront. When a speaker switches between Hindi and English mid-sentence, or drops Tamil words into an otherwise English conversation, the model follows along without losing accuracy at language boundaries. This is not a pipeline that detects language segments and routes them to separate engines. It is a single unified model that understands multilingual Indian speech as a continuous stream.

An estimated 350 million Indians speak English as a second language, and the vast majority mix it freely with their primary language. A customer might say "Mera order ka status kya hai, I placed it last Tuesday." A doctor might dictate "Patient ko high fever hai, CBC and LFT karwao." This is normal Indian communication, and Saaras v3 transcribes it accurately because it was trained on exactly this kind of speech.

Noise robustness

Most global ASR models are trained on clean, studio-quality audio. Real Indian audio is not clean. Call center recordings come in at 8kHz with compression artifacts. Field interviews have traffic noise. Group discussions have overlapping speakers. Mobile calls have network dropouts. Saaras v3 is trained on exactly this kind of audio, so it holds up where other models degrade.

No pre-processing or noise cancellation needed on your end. The robustness comes from a 4-stage training pipeline: pre-training on diverse acoustic conditions, supervised fine-tuning, reinforcement learning, and a post-training stage that specifically targets long-tail errors and output stability. For voice agent deployments where callers are speaking from busy streets, factory floors, or crowded homes, this is the difference between a system that works in production and one that only works in demos.

Speaker diarization

Speaker diarization answers "who said what?" in a multi-speaker recording. The Batch API identifies speakers and tags each word with a stable speaker ID. You need this for meeting transcriptions (attributing action items), call center analytics (separating agent and customer utterances), and interview recordings (distinguishing interviewer from interviewee).

Diarization works across all 22 supported languages and handles overlapping speech, speaker turns of varying length, and conversations where speakers have similar voice characteristics. The speaker labels are consistent across the entire recording, so Speaker 1 at minute 2 is the same person as Speaker 1 at minute 45.

Output flexibility

Sarvam STT provides four output modes to match different use cases. All modes include word-level timestamps by default.

Transcribe

With formatting and number normalization.

मेरा फोन नंबर है 9840950950

Translate

From Indic languages to English.

My phone number is 9840950950

Transliteration

Indian languages written in English letters.

mera phone number hai 9840950950

Verbatim

Preserves fillers and spoken numbers.

मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero

Voice dictation and typing

Voice dictation (speaking instead of typing) is how a lot of people prefer to interact with devices, especially on mobile. Sarvam's speech recognition powers voice typing in Indian languages: dictate messages, emails, notes, and documents in Hindi, Tamil, Telugu, or any of the 22 supported languages. Unlike generic voice dictation that forces you into English, Sarvam outputs correct Devanagari (or whatever script you are dictating in), adds punctuation automatically, and keeps up with natural speaking speed. If you type faster by talking, or find keyboards difficult, voice dictation closes that gap.

Multi-speaker conversation, transcribed

Play the audio below and watch the transcript highlight in real time. Each speaker gets a distinct color.

Panel Discussion

Every word transcribed. Every speaker identified.

0:00

0:00 - 0:14

Speaker 1

એ રીતે તમે આગળ વધો પોતાની ધર્મની માન્યતાઓને ને પોતાના અત્યાર સુધી તમે પ્રેક્ટિસ કરતા આવ્યા છો એને છોડીને તમે એક જ નિયમ પ્રમાણે ચાલો તો આ યોગ્ય નહીં હોય કારણ કે આપણે જ્યારે દેશ બન્યો ત્યારે દરેકને આશ્વાસન આપ્યું હતું

0:14 - 0:27

Speaker 1

કે એમને જેવી રીતે દેશમાં શાંતિથી રહેવું હશે તો એમને છૂટ આપીશું પોતાના ધર્મ પ્રમાણે પોતાની જાતિ પ્રમાણે કે જે તેમના નિયમો હોય કોઈ કુટુંબના હોય કે જે હોય ગામના એ પ્રમાણે એ પ્રેક્ટિસ કરી શકશે

0:28 - 0:38

Speaker 1

તો આ હાલ ડિસ્કશનમાં ચાલે છે ને એના ફાયદા ગેરફાયદાનો આપણે ડિસ્કસ કરવાનો છે કે શું આવું કરવું જોઈએ હાલ સમય આવી ગયો છે કે બધાને ફોર્સફુલી

0:35 - 0:36

Speaker 2

શું આવું કરવું જોઈએ

0:39 - 0:58

Speaker 1

આપણે એક જ નિયમ બનાવી દઈએ ને બધાને કહીએ કે તમે આ જ નિયમ ફોલો કરો ચાહે તમે હિન્દુ હોય તો પણ તમે આ જ નિયમ ફોલો કરો તમારા ધર્મ ને છોડી દો મુસલમાન હતા તો એમના ધર્મના જે નિયમો હોય એ છોડો અને આને જ ફોલો કરો શું આ સમય આવી ગયો છે કે આપણે આવો કાનૂન બનાવવો જોઈએ જેવી રીતે અમેરિકામાં યુરોપમાં છે ત્યાં કોઈ ધર્મ ધર્મ નથી જોવામાં આવતું જ્યારે એ લોકો કોર્ટમાં જાય છે

0:59 - 1:04

Speaker 1

ત્યાં કોર્ટના દેશના કાનૂન પ્રમાણે કેસ લો કરવામાં આવે તો શું આવું ઇન્ડિયામાં કરવું જોઈએ આ એના વિશે છે

1:06 - 1:07

Speaker 2

એક દેશ એક કાયદો

1:10 - 1:14

Speaker 2

આપણે કહી શકીએ કે એક દેશ અને એક કાયદો બધાએ આ જ કાયદો માનવું જોઈએ

1:13 - 1:32

Speaker 1

કાયદો માનવો જોઈએ હા માનવો જોઈએ પછી તમે તમારો ધર્મ ગમે તે કહેતું હોય તમે ગમે તે માન્યતામાં માનતા હોવ તમને યોગ્ય લાગતું હોય પણ ફોર્સ કરવામાં આવશે ગવર્મેન્ટ તરફથી કે હવે તમારે આ જ માનવું પડશે તો આ યોગ્ય છે દરેક સાથે આવી રીતે ફોર્સફુલી એમનું ઇમ્પ્લિમેન્ટ કરવું હાલ શું આ સમય યોગ્ય છે લોકો એટલા મેચ્યોર છે

1:32 - 1:34

Speaker 1

આ હાલ ડિસ્કશન ચાલે છે

1:34 - 1:51

Speaker 2

મારા હિસાબથી આમાં જોવા જાય તો બંનેના નુકસાન છે અને બંનેના જ ફાયદા છે આપણે અગર આને ઇમ્પ્લિમેન્ટ કરીએ એક દેશ એક કાયદો તો આપણા ધાર્મિક સ્વતંત્રતામાં બી આપણે હસ્તક્ષેપ થાય છે આપણા સંસ્કૃતિને બી આપણે નુકસાન થાય છે

1:51 - 2:06

Speaker 2

અને આને અમલ કરવું મુશ્કેલ છે કારણ કે લોકો ધર્મના પ્રતિ ભારતમાં વધારે પડતા છે બધાને પોતાનો ધર્મ પહેલો છે એટલે આને લાગુ કરવું તો મુશ્કેલ છે જ અને લોકો તો

2:04 - 2:11

Speaker 3

અને લોકશાહી શાસનથી ભી ડિફરન્ટ થશે ને કે લોકો માટે લોકોથી અને લોકો વડે ચાલતું શાસન

2:11 - 2:35

Speaker 3

તો એના અગેન્સ્ટના પ્રોટોકોલમાં આપણે જતા રહીશું કારણ કે આ લોકોથી તો નહીં જ ચાલતું હોય ને કારણ કે આમાં તો એક એક તમે રૂલ્સ બનાવી દીધું એન્ડ એ બધાના ઉપર તમે થોપી દેવામાં આવે કે તમે લોકો કરો જ બરાબર શાયદ એમ એ લોકો માટે બેનિફિશિયલ હોય પણ ખરા ના બી હોઈ શકે તો એમના ધાર્મિક રિચ્યુઅલ્સ પ્રમાણે એ કનેક્ટ થતું પણ હોય કે ના પી થતું હોય

2:15 - 2:16

Speaker 2

અ

2:35 - 2:43

Speaker 3

તો આ લોકો માટે તો નહીં ચાલતું રહે તો આ લોકશાહીનો ભી ભંગ થાય એવું લાગી રહ્યું છે આમ આ આનાથી તો

2:44 - 2:59

Speaker 2

પણ આમાં ફાયદો બી એ છે કે આમાં કાયદા આમ સરળ થઈ જાય છે અને આમાં લોકો આમાં ગવર્મેન્ટ એવું વિચારે કે ધર્મના ઉપર નાગરિક નાગરિક તરીકે ઓળખ વધારે થાય અને આમાં સ્ત્રી પુરુષની સમાનતા બી થાય

3:00 - 3:08

Speaker 2

આપણને જેન્ડર ઇક્વાલિટી બી મળે આમાં સ્ત્રી પુરુષને બરાબર બી આવી શકે છે એ એક પોઇન્ટ છે જે મને ગમ્યો આમાં

3:07 - 3:07

Speaker 1

પોઈન્ટ છે

3:08 - 3:22

Speaker 1

મારે એવું માનવું છે કે આની અંદર બધા જે વિષયો છે એક સાથે લઈ લેવા કેમ કે જમીનનો ઇસ્યુ હોય અથવા તો શાદીનો હોય બીજા બધા જે ઇસ્યુઝના હોય એના કરતાં ગવર્મેન્ટે અમુક જે કોમન વિષયો હોય

3:22 - 3:34

Speaker 1

કે જેમાં કોઈ પ્રોબ્લેમ દરેકને વધારે થાય એવું નથી ધર્મની રીતે જેમ કે મારે કઈ પ્રોપર્ટી ખરીદવી છે એના માટેના નિયમો હોય કોઈ બિઝનેસના રિલેટેડ હોય અથવા બીજા કોઈ હોય તો પહેલાં શરૂઆત એમણે આનાથી કરવી જોઈએ

Speech to text use cases

Speech to text shows up everywhere: voice agents, compliance archives, YouTube captions, meeting notes. Whether you need a batch audio to text converter for recorded files or real-time voice to text for live interactions, here is how organizations across India are using Sarvam in production.

Voice agents

Real-time transcription for live voice agents and customer interactions.

Customer support

Sales & lead qualification

Edtech tutors

Social & companion bots

Post-call analytics

Analyze call transcripts to extract insights, sentiment, and actionable intelligence.

Sentiment analysis

Quality assurance & training

Compliance & record keeping

Performance metrics & insights

Subtitling & captioning

Generate accurate subtitles for video content across languages.

OTT platforms

YouTube creators

Corporate training

Live broadcasts

Beyond the use cases above, speech to text is essential for accessibility: real-time captions for hearing-impaired users, voice dictation for people with motor disabilities, and multimodal input for the hundreds of millions of Indians who are more comfortable speaking than typing. Sarvam supports real-time captioning in 22 Indian languages, making digital content accessible to audiences that global STT providers underserve.

Speech to text in 22 Indian languages

Sarvam's voice to text converter supports all 22 scheduled Indian languages plus English, covering over 99% of India's population. Whether you need speech to text in Hindi, speech to text in Tamil, voice to text in Telugu, or audio to text in Bengali, all languages are handled by a single unified model with automatic language detection and native code-mixing support. Click any language below to see transcription samples and API details.

ગુજરાતીGujarati · gu-IN

ಕನ್ನಡKannada · kn-IN

മലയാളംMalayalam · ml-IN

অসমীয়াAssamese · as-IN

اردوUrdu · ur-IN

संस्कृतम्Sanskrit · sa-IN

नेपालीNepali · ne-IN

डोगरीDogri · doi-IN

बड़ोBodo · brx-IN

ਪੰਜਾਬੀPunjabi · pa-IN

ଓଡ଼ିଆOdia · or-IN

कोंकणीKonkani · kok-IN

मैथिलीMaithili · mai-IN

سنڌيSindhi · sd-IN

कॉशुरKashmiri · ks-IN

মৈতৈলোন্Manipuri · mni-IN

ᱥᱟᱱᱛᱟᱲᱤSantali · sat-IN

How to convert speech to text with the API

Getting started

You can convert speech to text online using the playground above, or integrate via API for production use. Sign up at sarvam.ai/try/stt-api to get your API key. Install the Python SDK with pip install sarvamai or the Node.js SDK with npm install sarvamai. Your first audio to text conversion takes under 5 minutes from signup to working output.

from sarvamai import SarvamAI

client = SarvamAI(
  api_subscription_key="YOUR_KEY"
)

response = client.speech_to_text.transcribe(
    file_path="audio.wav",
    language="hi-IN",
    model="saaras:v3"
)

print(response.transcript)

Sarvam offers three API variants. The REST API handles files under 30 seconds with synchronous processing. The Batch API handles files up to 1 hour with speaker diarization and word-level timestamps. The Streaming API provides real-time transcription via WebSocket for voice agents, live captions, and interactive applications. Full API reference, code examples, and integration guides are available at docs.sarvam.ai. Explore the developer hub for SDKs, tutorials, and community resources.

Pricing: Rs. 1.5 per minute of audio on the standard plan. Free credits are included on signup for evaluation. Enterprise volume discounts are available. See full details at /api-pricing.

From audio to action, in one API

Move from speech to structured transcripts with a single integration.

Capture audio your way

Stream live from a microphone or upload a finished recording. The same API handles both with first token in milliseconds.

Transcribe across 22 Indian languages

One multilingual model, every scheduled language. Code-mixing, accents, and noisy audio handled without per-language routing.

Get more than a transcript

Speaker turns, word-level timestamps, and language tags arrive with every response. No add-ons, no extra calls.

Ship to production

Drop into your stack with official Python and Node SDKs, plus native Pipecat and LiveKit support. Most teams are live the same day.

Hear Indian speech to text accuracy

Sarvam STT handles the audio that breaks other models: code-mixed conversations, noisy telephony, and multi-speaker discussions. Listen to the samples below and see the transcription quality on real Indian audio.

Seamless code-mixing

Understands when speakers switch between Hindi, English, and regional languages mid-sentence.

00:00

HCG MCC Hospital ஒரு ground breaking achievement பண்ணிட்டாங்க.

Telephony-optimized

Handles real call center audio: 8kHz, background noise, multiple speakers.

00:00

नमस्कार डब्ल्यू सी बैंक में संपर्क करने के लिए आपका धन्यवाद

Handle noisy audio

Background noise, cross-talk, poor connections. Our models maintain accuracy even in challenging acoustic conditions.

00:00

अच्छा, मौजूदा सरकार का कामकाज आपको कैसा लगता है?

Speech to text performance

<250ms Median latency

100M+ Mins transcribed

>99.5% Uptime

Saaras v3 processes over 100 million minutes of audio monthly across production deployments in banking, telecom, healthcare, and government. The sub-250ms median latency enables real-time voice agents and live captioning. Enterprise SLAs guarantee 99.5% uptime with dedicated infrastructure for high-throughput workloads.

Speech to text: Sarvam vs Google, AWS, Azure

Feature	Sarvam	Google Cloud STT	AWS Transcribe	Azure Speech
Indian languages supported	22	8	1 (Hindi)	9
Code-switching (Hinglish, Tanglish)
Speaker diarization
Noisy telephony (8kHz) optimized
Data sovereignty (India)
Real-time streaming
On-premise deployment
Free tier	Free credits	60 min/mo	60 min/mo	5 hrs/mo

This comparison focuses on what matters for Indian language deployments. In direct benchmarks, Saaras v3 was evaluated against GPT-4o Transcribe, Gemini 3 Pro, Deepgram Nova3, and Scribe v2 on the IndicVoices dataset. The gap is wide on the 12 less-resourced Indian languages, where competitors either lack coverage entirely or produce heavily degraded transcriptions. Sarvam also uses a 5-metric evaluation framework (WER, LLM-WER, COMET, Intent Score, Entity Preservation) instead of relying on WER alone, because for Indian languages, two transcriptions with identical WER can have very different semantic accuracy.

The difference is architectural. Saaras v3 is not an English model adapted for India. It was built on Indian speech data from the start, covering all 22 scheduled languages in a single model. The result is accurate transcription on the audio Indian applications actually produce: noisy, code-mixed, multi-speaker, dialectally diverse. On data sovereignty: Sarvam processes everything within India. For companies regulated by RBI, IRDAI, or SEBI, that is not a feature, it is a requirement. Full pricing comparison on the pricing page.

Enterprise-ready. Data stays in India.

Compliance, control, and data sovereignty. Not bolted on. Built in from day one.

No training on your data

Your API inputs are never used for model training. Zero data retention after processing unless you explicitly request it.

Data deleted after processing by default
Opt-in retention with configurable TTL
Separate data and model training pipelines
Full DPDP compliance

Deploy on your terms

All processing happens within India. No cross-border transfers. For regulated workloads, we support VPC and on-premise deployment.

India-only data processing
VPC and on-premise options
Consent-based voice cloning
Content safety filters built in

Security and governance

Every API call is logged and traceable. Role-based access, audit trails, and data residency controls built into the platform.

SOC 2 Type IIISO 27001DPDP compliantRole-based accessFull audit trailData residency controls

Indian enterprises in banking, insurance, telecom, and government operate under strict regulatory requirements. RBI mandates data localization for financial data. IRDAI requires audit trails for customer communications. DPDP Act governs how personal data is processed. Sarvam is built to meet these requirements: SOC 2 Type II and ISO 27001 certified, all data processed within India, no cross-border transfers, no training on customer data, and full audit-ready logging for every API call.

Speech to text: frequently asked questions

Convert speech to text for free 22 Indian languages, real-time voice to text, no signup needed

Convert speech to text for free

22 Indian languages, real-time voice to text, no signup needed

Try Free