Question 1

How do I get started with the speech to text API?

Accepted Answer

Install the Sarvam SDK (pip install sarvamai for Python, npm install sarvamai for Node.js), get an API key from dashboard.sarvam.ai, and send an audio file to the speech-to-text endpoint with a language code and model name. You get back a transcript, word-level timestamps, and optional speaker labels.

Question 2

How do I transcribe audio in Python with the Sarvam SDK?

Accepted Answer

Import SarvamAI, create a client with your API key, then call client.speech_to_text.transcribe() with file_path, language (e.g. "hi-IN"), and model ("saaras:v3"). The response includes the full transcript and a words array with start time and text for each token.

Question 3

Which speech recognition API works best for Indian languages?

Accepted Answer

Saaras v3 is built specifically for Indian audio. It covers all 22 scheduled languages plus English in a single unified model, handles code-switching between languages mid-sentence, and is trained on over 1 million hours of real-world Indian audio spanning telephony, field recordings, and studio-quality input.

Question 4

Does the API support speaker diarization?

Accepted Answer

Yes. Enable speaker diarization in your API call to get stable speaker labels across conversational turns. Each word in the response is tagged with a speaker ID, so you can reconstruct who said what. This works in both batch and streaming modes.

Question 5

What is automatic speech recognition and how accurate is Saaras?

Accepted Answer

Automatic speech recognition (ASR) converts spoken audio into text. Saaras v3 is optimized for Indian languages and accents, trained on diverse acoustic conditions including noisy telephony, ambient sound, and cross-talk. It offers three streaming modes (Fast, Balanced, Accurate) so you can pick the right tradeoff for your use case.

Question 6

How does the voice recognition API handle noisy audio?

Accepted Answer

Saaras v3 is trained on real-world audio with background noise, crosstalk, poor connections, and low-bitrate telephony (8kHz). It maintains accuracy across challenging acoustic conditions without requiring pre-processing or noise cancellation on your end.

Question 7

Is there a free tier for the speech to text API?

Accepted Answer

Yes. New accounts get free credits on signup to test transcription across languages and audio types. After the free tier, pricing is usage-based per minute of audio. Volume discounts are available for high-throughput workloads.

Question 8

What audio formats does the STT API accept?

Accepted Answer

The API accepts WAV, MP3, FLAC, OGG, WebM, and raw PCM audio. Supported sample rates range from 8kHz (telephony) up to 48kHz. Audio is automatically resampled when needed, so you can send whatever format your pipeline produces.

Sarvam Speech to
Text API

Production-grade automatic speech recognition

Streaming-first architecture

Code-switching & noise robust

Drop-in SDKs

23 Indian languages

Beyond raw transcripts

Streaming-first architecture

Code-switching & noise robust

Drop-in SDKs

23 Indian languages

Beyond raw transcripts

Powering real-world audio experiences

Seamless code-mixing

Telephony-optimized

Handle noisy audio

Developer-first platform

Battle-tested at scale

Works with your stack

Enterprise-ready. Data stays in India.

No training on your data

Deploy on your terms

Security and governance

Simple, transparent
pricing

Frequently asked questions

How do I get started with the speech to text API?

How do I transcribe audio in Python with the Sarvam SDK?

Which speech recognition API works best for Indian languages?

Does the API support speaker diarization?

What is automatic speech recognition and how accurate is Saaras?

How does the voice recognition API handle noisy audio?

Is there a free tier for the speech to text API?

What audio formats does the STT API accept?

Sarvam Speech to Text API

Production-grade automatic speech recognition

Streaming-first architecture

Code-switching & noise robust

Drop-in SDKs

23 Indian languages

Beyond raw transcripts

Streaming-first architecture

Code-switching & noise robust

Drop-in SDKs

23 Indian languages

Beyond raw transcripts

Powering real-world audio experiences

Seamless code-mixing

Telephony-optimized

Handle noisy audio

Developer-first platform

Battle-tested at scale

Works with your stack

Enterprise-ready. Data stays in India.

No training on your data

Deploy on your terms

Security and governance

Simple, transparent pricing

Frequently asked questions

How do I get started with the speech to text API?

How do I transcribe audio in Python with the Sarvam SDK?

Which speech recognition API works best for Indian languages?

Does the API support speaker diarization?

What is automatic speech recognition and how accurate is Saaras?

How does the voice recognition API handle noisy audio?

Is there a free tier for the speech to text API?

What audio formats does the STT API accept?

Sarvam Speech to
Text API

Simple, transparent
pricing