# Akshar: Intelligent Document Digitisation Platform

Akshar is Sarvam's document digitisation platform that reads, understands, and extracts knowledge from real-world documents -- scanned archives, handwritten notes, ancient scripts, complex tables, and dense Indic scripts. Built on the Sarvam Vision foundation model, Akshar functions as an intelligence layer that moves beyond passive text extraction to active reasoning, grounded understanding, and automated error correction.

## Why Existing Solutions Fall Short

### Legacy OCR Limitations

Classical OCR engines (Tesseract, EasyOCR, Google Cloud Vision) rely on character-level, bottom-up recognition without understanding page structure. This causes:

- Multi-column pages read linearly, producing garbled output
- Layout structure lost before extraction begins
- Indic scripts misinterpreted -- complex conjuncts and diacritics (matras) frequently misread
- No distinction between body text and marginalia

### Multimodal LLM Limitations

Modern vision-language models have improved document understanding but still suffer from:

- Low accuracy on complex documents (historical newspapers, publications with embedded charts)
- Probabilistic outputs lacking auditability
- Manual prompt tuning leading to factual inconsistencies
- No built-in error correction workflow

### Unsolved Last-Mile Problems

- Zero-shot parsing of complex layouts without training data
- Capturing semantic relationships between page elements
- Visual grounding (mapping extracted text to exact source locations)
- Automated error localisation and correction at scale with human-in-the-loop

## Core Capabilities

### Layout Understanding

Akshar understands semantic blocks -- headers, paragraphs, footnotes, tables, images -- rather than processing text line by line. Multi-column pages and mixed-format documents are parsed with structure preserved.

### Reading Order Detection

Intelligent reading order detection ensures content flows correctly even in complex layouts with sidebars, footnotes, captions, and multi-column arrangements.

### Text Extraction

Handles 22 Indian languages and English, including multilingual pages in a single pass. Accurate recognition of complex conjuncts, diacritics, and dense Indic scripts.

### Structured Output

Delivers output in three formats with layout and reading order preserved:

- **HTML** -- structured markup retaining document hierarchy
- **JSON** -- page-level structured data with block-level metadata
- **Markdown** -- clean text with formatting preserved

Output is delivered as a ZIP archive. A JSON file with structured page-level data is always included alongside the chosen format.

### Visual Grounding

Pinpoints exact coordinates of text and elements in documents. Each extracted block is mapped to its bounding box on the source page, enabling click-to-scroll navigation between source and output.

### Table Extraction

Intelligent table detection and conversion to structured formats. Handles complex tables with merged cells, multi-line entries, and nested headers common in Indian government and financial documents.

## Review and Correction Workflow

Unlike one-shot OCR tools, Akshar provides a full review workflow before export:

### Agent-Driven Corrections

AI agents identify uncertainties and probable errors automatically. Agents propose reviewable suggestions that users can accept or reject, keeping human trust and auditability front and centre.

### Visual Grounding Review

Click any extracted block to see its exact location on the source document. Verify extraction accuracy with side-by-side comparison.

### Manual Editing

Fix text, relabel blocks, and restructure layout with the source document displayed alongside. Interact with agents via chat to fix issues across pages or within specific blocks.

## Agent-Powered Pipeline

Akshar uses an agent-based workflow for end-to-end document processing:

1. **Start with extraction** -- Documents are processed through Sarvam Vision's harness modules and VLM. Agents identify and fix common errors in the baseline extraction.
2. **Apply instructions automatically** -- Instructions provided at upload time are executed autonomously across every page, consistently.
3. **Proofread with context** -- Issues can be reviewed and corrected across the document or within specific sections through a simple interface.
4. **Export structured output** -- Final output delivered in HTML, JSON, or Markdown with full layout preservation.

All agent actions are proposed as reviewable suggestions. Human oversight is built into the process, not bolted on.

## Language Support

23 languages with native script understanding:

| Language | Script | Code |
|---|---|---|
| Hindi | Devanagari | hi-IN |
| Bengali | Bangla | bn-IN |
| Tamil | Tamil | ta-IN |
| Telugu | Telugu | te-IN |
| Marathi | Devanagari | mr-IN |
| Gujarati | Gujarati | gu-IN |
| Kannada | Kannada | kn-IN |
| Malayalam | Malayalam | ml-IN |
| Assamese | Bangla | as-IN |
| Urdu | Nastaliq | ur-IN |
| Sanskrit | Devanagari | sa-IN |
| Nepali | Devanagari | ne-IN |
| Dogri | Devanagari | doi-IN |
| Bodo | Devanagari | brx-IN |
| Punjabi | Gurmukhi | pa-IN |
| Odia | Odia | od-IN |
| Konkani | Devanagari | kok-IN |
| Maithili | Devanagari | mai-IN |
| Sindhi | Arabic | sd-IN |
| Kashmiri | Arabic | ks-IN |
| Manipuri | Meitei | mni-IN |
| Santali | Ol Chiki | sat-IN |
| English | Latin | en-IN |

All 22 Constitutionally recognised Indian languages plus English. Multilingual pages are handled in a single pass.

## Supported Input Formats

- PDF documents (single and multi-page)
- PNG images
- JPG/JPEG images
- ZIP archives (batch processing with automatic page handling)

## Document Types and Industry Use Cases

Akshar handles the full range of documents across industries:

### Healthcare

Health checkups, blood test reports, prescriptions (general and dental). Extract structured data from medical records for digitisation and analysis.

### Financial Services

Financial statements, invoices, regulatory filings. Accurate table extraction for structured financial data.

### Legal

Contracts, court filings, legal records. Preserve document structure for searchable legal archives.

### Insurance

Claims documents, policy records, assessment reports.

### Education and Research

Manuscripts, newspapers, textbooks, primary sources. Extract text from academic repositories and research archives.

### Publishing and Archives

Scanned books, backlists, and out-of-print titles converted into accessible e-books. Historical document preservation.

### Government

Government records, archives, and public documents. Large-scale digitisation of administrative records across Indian languages.

## Benchmark Performance

Sarvam Vision, the foundation model powering Akshar, achieves:

- Best-in-class scores on global benchmarks including olmOCR-Bench and OmniDocBench for English
- Leading accuracy on the Sarvam Indic OCR Bench for Indian languages
- Outperforms frontier models (Gemini, GPT-series, Claude-series) on Indic OCR tasks
- 84.3% OCR accuracy on Indian documents (per independent benchmark reporting)
- Best document intelligence models achieve 90-95% accuracy; Akshar's agent-driven correction loop targets the remaining gap to near-100% for production use

## API and Developer Access

### Document Intelligence API

Developers can add digitisation capabilities to their products via the Akshar API:

- **Endpoint**: Available through the Sarvam API platform
- **API documentation**: docs.sarvam.ai/api-reference-docs/document-intelligence
- **Features**: Job management, progress tracking, error handling, batch processing
- **Authentication**: API keys available at dashboard.sarvam.ai
- **Free tier**: 100 free credits on signup, no credit card required

### Key API Features

- Asynchronous processing for large documents
- Batch processing of multi-page documents and ZIP archives
- Page-level structured data in every response
- Choice of HTML or Markdown output format
- Enterprise-ready scalability

## Platform Access

- **Akshar workbench**: akshar.sarvam.ai -- full document intelligence platform with visual review, agent corrections, and export
- **API playground**: dashboard.sarvam.ai/vision -- try Document Intelligence APIs directly
- **API documentation**: docs.sarvam.ai

## What Makes Akshar Unique

1. **Built for Indian documents** -- Native handling of 22 Indian languages with script-level accuracy, not retrofitted support on top of English OCR.
2. **Layout-aware semantic extraction** -- Understands document structure (headers, paragraphs, tables, footnotes) rather than processing raw text streams.
3. **Visual grounding** -- Every extracted block maps to exact coordinates on the source document for verification.
4. **Agent-driven error correction** -- AI agents identify and propose fixes for extraction errors, enabling experts to review hundreds of pages in the time it takes to manually transcribe one.
5. **Human-in-the-loop by design** -- All agent suggestions are reviewable. Auditability and trust are built into the core workflow.
6. **Structured output preservation** -- HTML, JSON, and Markdown output retains reading order, layout hierarchy, and semantic structure.
7. **Foundation model advantage** -- Powered by Sarvam Vision, a 3B parameter vision-language model purpose-built for document intelligence across English and Indic languages.
