Sarvam AI

Evaluating Indian Language ASR

A practical guide to layered Indic ASR evaluation: LLM-WER and LLM-CER, Intent and Entity scores, COMET, and open-source evaluation frameworks.

Research

14 min read
Evaluating Indian Language ASR

Introduction

Measuring how well a speech recognition system performs in Indian languages is harder than it looks. The standard metrics weren't built for them, and that mismatch quietly distorts how Indic ASR systems get evaluated.

Word Error Rate(WER), Character Error Rate(CER), and BLEU were developed primarily for English. They work well when every word has a single accepted spelling, when languages don't mix mid-sentence, and when the gap between formal and colloquial usage is narrow. Indian languages don't fit that description. Colloquial and formal registers coexist and are equally understood by speakers. English loanwords appear in both Indic and Latin script, sometimes within the same utterance. Numbers have multiple valid written forms. Applying these metrics without adjustment can make an Indic ASR system look significantly worse or better than it actually performs in practice.

This is a harder problem than it first appears. It isn't just that the metrics are imperfect. The deeper issue is that WER and CER penalize surface-level differences in character sequences without any understanding of whether two transcriptions mean the same thing. When a model correctly transcribes a spoken word but renders it in a different but equally valid script or spelling, the metric counts that as an error. The transcript is right. The score is wrong.

This blog describes a layered evaluation approach that addresses these gaps directly. It explains what WER and CER measure and where they break down, then introduces four LLM-based metrics - LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score. Together, they give a more accurate and complete picture of how an Indic ASR system is actually performing. The blog also introduces two open-source evaluation frameworks you can drop into an existing pipeline.

Throughout, we draw on examples from Saaras V3, Sarvam's speech recognition API for 22 Indian languages. Saaras V3 supports five output modes - transcription, translation, verbatim output, transliteration, and code-mix - which makes it a useful concrete anchor for the broader evaluation principles discussed. The open-source frameworks below can be adapted to evaluate any Indic ASR system.

llm_wer

llm_intent_entity

We don't think this approach is the final word on Indic ASR evaluation. The field is still developing, and the right set of metrics will likely evolve as the systems themselves do. But we believe this layered framework is meaningfully closer to what evaluation for Indian languages actually requires, and the tools exist today to start using it.

Working Example: Saaras V3 by Sarvam

This section below covers Saaras V3's output modes and delivery options. The metric discussions that follow reference these modes directly, so it helps to have them in view before diving in. If you're evaluating a different ASR system, the same principles apply. Substitute your system's equivalent modes and endpoints where relevant.

Output Modes

To make the differences between modes concrete, all five examples below are drawn from the same spoken Hindi sentence:

Input sentence (spoken Hindi) मुझे कल सुबह नौ बजे doctor के पास जाना है

Note the code-mixed English word 'doctor' and the spoken number 'नौ'

ModeWhat it returnsExample output from the sentence abovePrimary evaluation metric
TranscribeNormalised transcript with numbers, punctuation, and formattingमुझे कल सुबह 9 बजे डॉक्टर के पास जाना है।

Note:
  1. 'नौ' → '9' - spoken number normalised to digit
  2. 'doctor' → 'डॉक्टर' - loanword transliterated to Devanagari
  3. Full stop added.
LLM-WER / LLM-CER
TranslateEnglish translation of the spoken inputI need to go to the doctor tomorrow morning at 9.Intent Score + Entity Score + COMET
VerbatimExact word-for-word output, no normalisationमुझे कल सुबह नौ बजे डॉक्टर के पास जाना हैStandard WER (strict)
TranslitIndic script converted to the Latin alphabetMujhe kal subah nau baje doctor ke paas jaana hai.LLM-CER
CodemixMixed output preserving Indic and English tokensमुझे कल सुबह 9 बजे doctor के पास जाना है।

Note:
  1. 'नौ' → '9' - number normalised
  2. 'doctor' stays in Latin script rather than being transliterated

    The code-mixed nature of the speech is preserved.
LLM-WER + Entity Score
Scroll
Scroll

API Delivery Options

Saaras V3 is available through three delivery methods. The right choice depends on your file length, latency requirements, and integration architecture.

APIBest forLimitsResponse typeLatency
REST APISingle short files, webhook integrations, synchronous pipelinesUp to 30 seconds per fileSynchronous result returned in the same HTTP response2-5 seconds
Batch APILong recordings, bulk jobs, overnight pipelines, call centre archivesUp to 60 minutes per file; up to 20 files per requestAsynchronousreturns a job ID; poll for resultsMinutes to hours depending on queue
WebSocket StreamingReal-time voice assistants, live captions, interactive conversational botsContinuous audio stream; no fixed file size limitReal-time partial results as audio arrivesSub-second to first word
Scroll
Scroll

Metrics Overview

An overview of all five metrics covered in this blog, organised by category.

  • Traditional Metrics:String-matching and n-gram overlap, designed primarily for English.
    • WER / CER (Section 3): Word/Character Error Rate. Edit-distance metric; fast, well-understood, and the standard benchmark baseline.
      • COMET (Section 6): Neural translation metric trained on human judgements. Better than BLEU for measuring translation fluency.
    • BLEU (Section 7.2): N-gram overlap for translation quality. Included as context; superseded by COMET for modern use.
  • Gen-AI Metrics:LLM-based metrics that measure meaning, not just character distance.
    • LLM-WER / LLM-CER (Section 5): WER/CER rescored by an LLM judge. Segments that are semantically or phonetically equivalent are no longer counted as errors.
    • Intent Score (Section 7): Binary score (0 or 1). An LLM judges whether the core meaning of the utterance is preserved.
    • Entity Preservation Score (Section 8): Float between 0 and 1. The fraction of named entities (names, places, numbers, dates) that appear correctly in the transcription.

Indian languages have multiple valid forms for the same spoken word, colloquial and formal registers, code-mixed English loanwords, numbers written in digits or spelled out. Traditional metrics treat all of these as errors. Section 4 shows six concrete failure classes in detail, and the Gen-AI metrics in Sections 5 - 8 are designed to handle each of them.

Metric 1: Standard WER and CER

WER and CER are the traditional baselines for ASR evaluation. They are well-understood, fast to compute and genuinely useful in specific circumstances.

Definitions

Word Error Rate (WER) measures the edit distance between the ASR output and a reference transcript at the word level counting substitutions (S), deletions (D) and insertions (I) divided by the total reference word count (N):

WER = (Substitutions + Deletions + Insertions) / Total Reference Words
Word Error Rate (conceptual formula)

Character Error Rate (CER) applies the same formula at the character level. It is preferred over WER for agglutinative languages (Malayalam, Kannada, Telugu) where a single word token can be very long, making word-level edit distance disproportionately punishing.

When WER and CER Are Still Useful

  • mode="verbatim": exact transcription is the requirement WER is the correct strict measure
  • Benchmarking against published numbers on datasets like Vistaar or IndicVoices, which report raw WER
  • As a complementary baseline alongside LLM-based metrics, always report both

A note on evaluation strategy: For Indian languages, WER and CER work best as one layer in a multi-metric strategy, not as a standalone quality gate. Section 4 walks through the specific scenarios where LLM-based metrics give a more accurate picture.

Understanding the limits of Standard Metrics for Indian Languages

WER and CER were designed for English, which has a near one-to-one relationship between spoken words and written tokens, fixed spelling conventions, and no code-mixing in standard speech. Indian languages have more flexibility built in. That flexibility is a feature, not a problem. The six scenarios below illustrate where standard metrics misread a correct transcription as an error, and why LLM-based metrics give a fairer read.

Colloquial Variants: When 'Wrong' Is Right

Every Indian language has a formal written register and a colloquial spoken register. Native speakers use both interchangeably and understand both perfectly. WER treats any deviation from the reference form as an error, regardless of whether meaning is preserved.

Here is a concrete walkthrough for Tamil: how the reference, ASR output, metric verdicts, and real-world impact line up.

What the speaker said - A casual, colloquial Tamil sentence spoken naturally
Reference (formal) - அவர்கள் ஒன்றாக வேலை செய்கிறார்கள் (avargal: "they work together")
Example ASR output - அவுங்க ஒண்ணா வேலை செய்றாங்க (avunga: colloquial form, identical meaning)
Standard WER verdict - ❌ 4 out of 5 words flagged as errors. WER = 80%.
Reality - ✅ A native Tamil speaker hears this as a perfect transcription.
Business impact - If you set a WER threshold of 20% for your Tamil voice bot, you will reject a high-quality ASR output and may never ship.

Code-Mixing: The Script Mismatch Trap

Hundreds of millions of Indians code-mix naturally, switching between their native language and English mid-sentence. When the ASR and the reference annotator make different but equally valid choices about how to write an English word, WER registers it as an error.

Here is a concrete walkthrough for Hindi (code-mixed with English): how the reference, ASR output, metric verdicts, and real-world impact line up.

Reference - वह doctor के पास गया (English word kept in Latin script)
Example ASR output - वह डॉक्टर के पास गया (same word transliterated to Devanagari)
Standard WER verdict - ❌ Flags 'doctor' as a substitution. WER = 20% on this sentence.
Reality - ✅ Identical meaning. Both spellings are correct.
Business impact - A customer service bot for a bank handles code-mixed Hindi all day. Every loanword ('account', 'balance', 'transfer', 'nominee') is a potential false WER error. The model looks 15% worse than it is.

Short Helper Words: Exponential Penalties

Hindi, Bengali, and Marathi rely on short helper words, typically 2 to 3 characters. WER's division-by-word-count formula produces skewed scores when these words have minor deviations.

Here is a concrete walkthrough for Hindi: how the reference, ASR output, metric verdicts, and real-world impact line up.

Reference - नहीं ("no": 2 characters)
Example ASR output (echo repeat) - नहीं नहीं (word echoed once due to audio processing)
Standard WER verdict - ❌ WER = 300%. The model appears catastrophically wrong.
Second reference - है ("is": a 2-character helper word)
Example ASR output (diacritic drift) - हई (phonetically near-identical, minor diacritic shift)
Standard WER verdict - ❌ WER = 100% on this word; a complete failure for a single diacritic.
Reality - ✅ A Hindi speaker hears no meaningful difference in either case.

Agglutinative Languages: Suffix Substitutions

Malayalam, Kannada, and Telugu build long compound words by chaining morphemes. A minor suffix substitution, grammatically trivial, creates a large CER penalty because the entire token is affected.

Here is a concrete walkthrough for Malayalam: how the reference, ASR output, metric verdicts, and real-world impact line up.

Reference - വിദ്യാർത്ഥികളുമായി സംസാരിച്ചു ("spoke with the students"; suffix: ഉമായി)
Example ASR output - വിദ്യാർത്ഥികളൊടു സംസാരിച്ചു (same meaning, grammatically valid alternate suffix ഒടു)
Standard CER verdict - ❌ Multi-character penalty on a long token. CER = ~30% despite no meaningful difference.
Reality - ✅ Both forms are grammatically valid. Native speakers consider them equivalent.

Numeric Format Variability

The same number can appear in at least three valid written forms in an Indic language. WER treats all three as completely unrelated tokens.

Here is a concrete walkthrough for Hindi (numbers): how the reference, ASR output, metric verdicts, and real-world impact line up.

All three mean exactly "500" - पांच सौ (spoken) | 500 (Arabic numerals) | ५०० (Devanagari numerals)
Standard WER verdict - ❌ Treats all three as unrelated. If reference has '500' and output has 'पांच सौ', WER = 100% on this segment.
Dates follow the same pattern - '२५ जनवरी २०२८' vs '25-01-2024' vs '25 Jan 2024' are all correct, all penalised by WER when mixed.
LLM-WER verdict - ✅ Normalises numeric forms before scoring; All three treated as correct.

Semantic Drift Blindness: The Danger in Translate Mode

WER is also blind to single-word changes that completely reverse meaning. This is the most dangerous failure mode for Saaras in translate mode.

The following walkthrough uses a **spoken Hindi **utterance and shows how ASR output,WER, and **Intent Score **can disagree when intent flips.

Example ASR output - मैं कल स्कूल नहीं जाना चाहता हूं (one word insertion; intent completely reversed)
Standard WER verdict - ❌ Minor penalty (~14%). The translation appears almost correct.
Intent Score verdict - ✅ Score: 0 (FAIL). Correctly identifies this as a fundamental failure.

Key insight - WER and CER measure the distance between two strings. The LLM-based metrics in Sections 5-8 measure the distance between two meanings. For Indian languages, those two distances can be very different. Combining both perspectives gives the most accurate picture of ASR quality.

Metric 2: LLM-Based WER and CER

LLM-WER and LLM-CER address the six failure classes from Section 4 by replacing rigid string-matching with a semantic and phonetic judgement made by an LLM. Segments that are meaning-equivalent or phonetically similar are treated as correct. Only genuinely wrong segments count as errors.

llm_wer

Three-Step Methodology

  • Identify differences: Run standard WER/CER alignment to find the exact segments where the ASR output and reference do not match.
  • Consult the LLM: For each mismatched segment, the LLM is asked two questions:
  1. Are these phrases semantically equivalent? (Do they carry the same meaning?)
  2. Are they phonetically similar? (Would they sound the same when spoken?)

The LLM defaults to non-equivalence in any ambiguous case, confirming a match only with high certainty.

  • Score intelligently: Segments the LLM judges as equivalent are treated as correct. The final LLM-WER/CER is computed only over genuinely erroneous segments.

Hallucination capping - When Saaras echoes a word (नहीं नहीं where the reference has नहीं), raw WER = 300%. LLM-WER caps each token's contribution at a maximum of 1.0, preventing a single hallucination from dominating your aggregate score.

Standard WER vs. LLM-WER: Side-by-Side

Failure classASR output vs referenceStandard WERLLM-WER
Colloquial Tamilavunga vs avargal❌ Error✅ Equivalent
Code-mixed Hindiडॉक्टर vs doctor❌ Script mismatch✅ Same word
Diacritic driftहई vs है❌ 100% on word⚠️ Soft penalty
Number formatपांच सौ vs 500❌ Mismatch✅ Normalised as equal
Hallucination repeatनहीं नहीं vs नहीं❌ 300%✅ Capped at 1.0
Malayalam suffixഉമായി vs ഒടു same meaning❌ High CER on long token✅ Semantic equivalence confirmed
Scroll
Scroll

Installation and Usage

llm_wer

  • Input: a CSV/JSONL file with columns ground_truth, asr_output, language, and optionally context.
  • The framework uses Gemini via Vertex AI as the LLM judge.

See the repository README for full configuration and output format details.

Metric 3: COMET Score

COMET is a neural translation quality metric trained on human translation judgements. Unlike BLEU, which counts n-gram overlaps, COMET encodes the source sentence, the reference translation, and the hypothesis using a multilingual transformer, producing a learned similarity score that correlates far better with human judgement.

Use COMET as a secondary metric alongside Intent Score when evaluating ASR outputs in translate mode. It is particularly useful for ranking model versions continuously: higher COMET means better quality. Intent Score is better at catching catastrophic semantic failures.

The following compares a reference (Can you come tomorrow? (reference English translation)) with model output and metric scores.

Saaras V3 output - Are you available to come tomorrow? (Different phrasing, same meaning)
Standard BLEU - ❌ Low score: n-gram mismatch on 'are you available' vs 'can you'. Penalises a perfectly valid rephrasing.
COMET (wmt22-comet-da) - ✅ ~0.91: recognises semantic equivalence. Correct verdict.

Usage Guidelines

  • Recommended model: Unbabel/wmt22-comet-da (reference-based) or cometkiwi (reference-free, no ground truth needed).
  • Always record the exact COMET model version. Scores are not cross-model comparable.
  • For lower-resource Indic languages not well-represented in COMET training data, weight Intent Score more heavily than COMET.

Metric 4: Intent Score

Intent Score is the primary metric for evaluating Saaras V3 in translate mode. It answers the single most important question: did Saaras correctly capture what the speaker was trying to say?

llm_intent_entity

Scoring: Binary (0 or 1)

Intent Score is a binary judgment made by an LLM judge.

ScoreMeaningThe LLM judge's criterion
1 ✅ PASSThe core message is intactMinor spelling variations, equivalent phrasing, and synonym use are all acceptable
0 ❌ FAILThe meaning has changedSubject or object altered, action reversed, statement converted to a question, or key information added or removed
Scroll
Scroll

Why BLEU Fails for translate mode: Four Concrete Failures

Failure 1: Valid paraphrasing is penalised

The following compares a reference (Can you come tomorrow?) with model output and metric scores.

  • Example ASR output - Are you available to come tomorrow?
  • Standard BLEU - ❌ Low score; different n-grams, different word choice. Yet meaning is perfectly preserved.
  • Intent Score - ✅ Score: 1 (PASS). Core message intact.

Failure 2: Code-mixed input has no clean reference

Code-mixed Hindi input with no single canonical translation - how BLEU variance contrasts with Intent Score.

  • Example ASR output - I won't go to the office tomorrow
  • BLEU verdict - ❌ Highly sensitive to reference wording. Score varies wildly depending on the reference annotator's phrasing.
  • Intent Score verdict - ✅ Score: 1 (PASS); captures the meaning accurately.

Failure 3: Semantic drift is invisible to BLEU

The following compares a reference (I want to go to school) with model output and metric scores.

  • Example ASR output - I do not want to go to school (one word inserted, intent completely reversed)
  • Standard BLEU - ❌ Minor penalty (~14%). The output appears almost correct.
  • Intent Score - ✅ Score: 0 (FAIL). Correctly identifies this as a fundamental failure.

Failure 4: Many valid translations exist for a single utterance

Translate-mode phrasing variance: reference vs ASR output, BLEU, and Intent Score.

  • Reference - I'm hungry, is there anything to eat?
  • Example ASR output - I'm feeling hungry, do you have any food?
  • BLEU verdict - ❌ Low score - 'anything to eat' ≠ 'any food' as n-grams.
  • Intent Score verdict - ✅ Score: 1 (PASS) - same meaning, correct evaluation.

Recommended Threshold Flag any utterance with Intent Score = 0 for manual review. For aggregate reporting, track the percentage of utterances scoring 1 across your evaluation set. This is your Intent Pass Rate.

Metric 5: Entity Preservation Score

Intent Score tells you whether the general meaning is preserved. Entity Preservation Score tells you whether specific named entities are transcribed correctly. These are the pieces of information that carry the most weight in domain applications. A single wrong entity can be a catastrophic failure regardless of overall intent.

llm_intent_entity

Source note: The scoring methodology below is based on the sarvamai/llm_intent_entity GitHub README. The LLM judges entity preservation using prompt_template.txt in that repository.

Why Entity Score Matters: Three Production Failures

Banking voice bot

  • Spoken: 'Transfer ₹5,000 to account 9876543210'
  • ASR output: 'Transfer ₹5,000 to account 9876543220'
  • Intent Score = 1 (PASS - intent preserved). Entity Score = 0.5 (FAIL - account number wrong). Transaction fails.

Navigation app

  • Spoken: 'वसंत विहार जाना है'
  • ASR output: 'I need to go to Vasant Kunj'
  • Intent Score = 1 (PASS - intent of navigation preserved). Entity Score = 0.0 (FAIL - wrong destination). User arrives at the wrong place.

Medical transcription

  • Spoken: 'Give 500mg of Metformin twice daily'
  • ASR output: 'Give 250mg of Metformin twice daily'
  • Intent Score = 1 (PASS). Entity Score = 0.5 (FAIL - dosage halved). Patient safety issue.

How the score is computed

Entity Preservation Score is a float between 0 and 1. It represents the fraction of key entities in the ground truth that were correctly preserved in the ASR output.

  • The LLM identifies key entities from the ground truth: person names, place names, organisation names, dates, times, numeric quantities, and object names.
  • It then checks each entity against the ASR output, penalising for missing entities (present in reference, absent in output) and substituted entities (present but wrong).
  • If the ground truth contains no entities, for example "hello, how are you?", the score is automatically 1.0.
ReferenceSaaras V3 OutputEntity ScoreWhat failed
Call John at 3pmCall John at 3pm1.0 ✅Nothing
Call John at 3pmCall at 3pm0.5 ⚠️'John' missing
Transfer ₹5,000 to acc 9876543210Transfer ₹5,000 to acc 98765432200.5 ⚠️Account number wrong
Hello, how are you?Hello, how are you doing?1.0 ✅No entities; auto 1.0
Scroll
Scroll

Intent Score + Entity Score Together

Neither metric alone gives the full picture. Use both.

ScenarioIntent ScoreEntity ScoreDiagnosis
Fluent output, entity dropped1 (PASS)0.5 ❌Entity-level failure hidden behind passing intent
Negation flipped, entity present0 (FAIL)1.0 ✅Semantic failure - entity preserved but meaning reversed
Both correct1 (PASS)1.0 ✅Pass
Both failed0 (FAIL)0.0 ❌Critical failure: escalate for model review
Scroll
Scroll

Recommended threshold - Entity Score ≥ 0.9 for general use cases. Entity Score ≥ 0.95 for banking, medical, or navigation applications where a wrong entity has direct downstream consequences.

Open-Source Evaluation Frameworks

Sarvam has open-sourced two repositories that implement the full evaluation stack described in this blog. Both are freely available on GitHub and work with any Indic ASR system, not only Saaras V3. Both use Gemini via Google Vertex AI as the LLM judge. The methodology is compatible with other capable multilingual LLMs.

llm_wer: LLM-Based WER and CER

llm_wer

  • Implements the three-step LLM-WER/CER methodology
  • Handles all six Indic language failure classes described in Section 4
  • Outputs per-utterance LLM-WER/CER scores with LLM explanation logs

Installation:

bash
git clone https://github.com/sarvamai/llm_wer

cd llm_wer

uv venv --python 3.12 && source .venv/bin/activate

uv pip sync uv.lock && uv pip install -e .
Install and set up the llm_wer environment

llm_intent_entity - Intent Score and Entity Score

llm_intent_entity

  • Implements binary Intent Score (0 or 1) per utterance
  • Implements Entity Preservation Score (0.0 to 1.0 float) per utterance
  • Outputs CSV with scores and LLM explanation text for every row
  • Hash-based caching layer to avoid re-evaluating identical inputs
  • Optional Google Sheets export for team-level reporting

Installation:

bash
git clone https://github.com/sarvamai/llm_intent_entity

cd llm_intent_entity

uv venv --python 3.12 && source .venv/bin/activate

uv pip sync uv.lock && uv pip install -e .
Install llm_intent_entity

Input columns required: ground_truth, asr_output, language, audio_file, context (optional). Supported languages: Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, English, and other Indic languages.

Running an evaluation:

python
from llm_intent_entity import evaluate

results = evaluate(
    dataset_path="eval_set.csv",
    reference_col_name="ground_truth",
    predicted_col_name="asr_output",
    language_col_name="language",
    context_col_name="context",
    ignore_cache=False,
    gemini_location="us-central1"
)
Run evaluation with llm_intent_entity

Ensuring Reproducible Results

LLM-based evaluation is only useful if results are consistent across runs. The most common cause of variability is forgetting to control the LLM's randomness. Here is the complete checklist.

Set temperature = 0

The single most important reproducibility setting: temperature = 0Setting temperature=0 switches the LLM to greedy decoding - the highest-probability token is always selected, making output deterministic for the same input. This one setting eliminates the large majority of run-to-run variability in LLM-based evaluation.

Pin the model version

  • Specify the exact model version: e.g., gemini-1.5-pro-002, not gemini-1.5-pro-latest
  • Even with temperature=0, provider-side model weight updates can shift scores. Pinning ensures a stable comparison baseline.
  • Run a fixed 50-sample calibration set any time you change the judge model or version. If scores shift more than 1 percentage point, treat the new run as a new benchmark series.

Use seed parameters where available

  • OpenAI (GPT-4o): pass seed=42 alongside temperature=0
  • Gemini (Vertex AI): pass seed in generation_config when available
  • Anthropic (Claude): temperature=0 is sufficient

Version-control prompts and use structured output

  • Store all evaluation prompts in the same repository as your evaluation code
  • Record a hash of (prompt + model version) alongside every result set
  • Require structured JSON output from the LLM judge (e.g., {"pass": true/false, "reason": "..."}) to constrain the output format and reduce parsing variability
  • Include 2-3 labelled few-shot examples per language in the system prompt to anchor the model's judgment scale

Normalise text before evaluation

  • Strip leading and trailing whitespace
  • Apply Unicode NFC normalisation to handle diacritic encoding variants
  • Pick one canonical numeric form (Arabic digits or Indic numerals) and apply it consistently to both reference and hypothesis before scoring

Reproducibility checklist

  • Set temperature=0 on all LLM judge calls
  • Pin the LLM model version (not latest)
  • Set seed parameter where supported (e.g., seed=42 for OpenAI)
  • Version-control all evaluation prompts; log prompt hash with every result
  • Apply Unicode NFC normalisation to all text before evaluation
  • Use structured JSON output in judge prompts
  • Include few-shot examples (2-3 per language) in the system prompt
  • Run 50-sample calibration set when changing judge model or version

End-to-End Evaluation Workflow

The workflow below applies to any Indic ASR system. The table maps each output type to the right metric stack. The pipeline that follows walks through the full sequence, from data preparation to final reporting.

Metric Selection by ASR Output Type

ASR output type / use casePrimary metricSecondary metrics
Normalised transcriptionLLM-WERStandard WER (baseline), Entity Score
Verbatim / exact transcriptionStandard WER (strict)CER for Dravidian languages
Translation to EnglishIntent ScoreCOMET, Entity Score
TransliterationLLM-CERStandard CER (baseline)
Code-mixed outputLLM-WEREntity Score
Domain-critical (banking, medical, navigation)Entity ScoreIntent Score, LLM-WER
Model regression testingWER/CER deltaLLM-WER delta on adversarial set
Scroll
Scroll

Step-by-Step Pipeline

  • Prepare evaluation data: collect audio with human-verified reference transcriptions or translations. Apply Unicode NFC normalisation to all text.
  • Run ASR inference: capture raw model output for the output type you are evaluating. Avoid post-processing before scoring.
  • Compute baseline WER/CER: apply text normalisation, then compute standard WER/CER as your regression baseline.

Run LLM-WER/CER: use the llm_wer framework with temperature=0 and a pinned model version. llm_wer

Run Intent Score and Entity Score: use the llm_intent_entity framework with the same LLM configuration. llm_intent_entity

  • Run COMET (translation output only): use Unbabel/wmt22-comet-da.
  • Aggregate and report: compute mean and median scores per language and per domain. Flag utterances with Intent Score = 0 or Entity Score below 0.8 for manual review.
  • Run calibration check: re-score a fixed 50-sample calibration set. If any score deviates more than one percentage point from the baseline, investigate the model or prompt drift before publishing results.

Quick Reference: Metric Summary

MetricBest forKey limitationThresholdLLM needed?
WER / CERVerbatim mode; regression deltasPenalises valid Indic variantsTrack as deltaNo
LLM-WER / LLM-CERtranscribe, translit, codemixAdds cost and latency< 15% generalYes
COMETtranslate - continuous rankingWeaker for low-resource languages
0.80
No (neural)
Intent Scoretranslate - meaning captureDoesn't catch entity errors100% pass rateYes
Entity ScoreBanking, medical, navigationNeeds context column for domain entities≥ 0.90Yes
Scroll
Scroll

Packages and Repositories

All packages and source repositories referenced in this blog, in one place.

MetricCategoryInstallGitHub Repository
LLM-WER / LLM-CERGen-AIpip install openaillm_wer on GitHub
Intent ScoreGen-AIpip install openaillm_intent_entity on GitHub
Entity ScoreGen-AIpip install openaillm_intent_entity on GitHub
Scroll
Scroll

Conclusion

Languages like Hindi and Tamil allow for a high degree of morphological variation. This flexibility makes them expressive, but it also introduces challenges for evaluation. Standard string matching approaches often fail to capture correctness in these settings. A transcription can be accurate in meaning and still produce a high WER. A translation can diverge semantically while continuing to score well on BLEU. In both cases, the issue is not that the metric is incorrect, but that it is measuring a different property than intended.

For this reason, no single metric is sufficient. WER and CER remain important because they are reproducible, interpretable, and comparable across systems. However, they primarily capture surface level similarity. LLM based evaluation methods are better suited to assessing semantic preservation, but they require careful grounding. Without a reference signal, these systems can drift over time. In practice, both types of evaluation are necessary, and each should be interpreted in the context of what it is designed to measure.

The choice of additional metrics should depend on the application. Intent Score is useful in settings where preserving meaning is critical, such as conversational systems and voice interfaces. Entity Score becomes more important when outputs are consumed by downstream systems, including databases, medical workflows, or financial processes. COMET is appropriate for evaluating translation quality specifically. These metrics are not interchangeable. Each captures a distinct class of failure.

There are also practical considerations in implementation. Running LLM based evaluators at temperature 0 improves consistency. Version pinning the evaluation model is equally important. Without these controls, results may shift in ways that are difficult to detect and reproduce.

All five frameworks described here are open source, and the align judge rescore approach can be applied across Indic ASR systems. The objective is to establish a shared baseline. When a team reports a given WER on Hindi, that number should be interpretable in a consistent way across organizations. While this level of standardization is not yet common, the necessary tools are already available.

The five-metric checklist

  • WER / CER: always report as your baseline
  • LLM-WER / LLM-CER: the semantically-aware complement to standard WER
  • COMET: for translate mode evaluation
  • Intent Score: when meaning preservation is the quality gate
  • Entity Score: when names, numbers, and places must be exact

Further Resources

[a]Recommended Title: Evaluating Speech Recognition for Indian Languages

[b]Evaluating ASR for Indian Languages

Curious what else we're building? Explore our APIs and start creating.