Sarvam AI
Ekatra Foundation
Sarvam AI

Returning Gujarati Literature to the People

How Ekatra used Sarvam AI to personalise learning experiences across Indian languages.


Education

"A culture is preserved only when people can read, debate, and build upon its ideas. Literature must be alive to be preserved. Not mummified for a museum.

We came to Sarvam with a hard problem: not just reading old Gujarati text, but understanding it. Archaic grammar, extinct words, linotype-era fonts, and a century of typographic conventions that no off-the-shelf OCR system was ever built to handle.

What we are building together is not a digitization pipeline. It is an act of cultural restitution. Fifty thousand books. Ten million pages. Returned to the people they belong to: readable, searchable, and alive."

Ekatra Foundation

Raj Mashruwala

Co-founder, Ekatra Foundation

Ekatra Foundation's Background

The Ekatra Foundation was established with a precise and personal mission: to preserve the Gujarati literature of the nineteenth and twentieth centuries, a body of work produced during one of the most consequential periods in Indian social and political history.

Gujarati literature from this era did not merely document reform. It participated in it. Writers, poets, and thinkers used the language to argue for change, to record resistance, and to imagine new possibilities. That literature is national history.

To Ekatra, preservation means more than storage. For a text to be read, searched, debated, and built upon, it needs to exist as Unicode: as actual, editable, structured text. Scanned images locked in a digital vault are not literature that circulates. A culture is preserved only when its ideas are alive and accessible to the people they belong to.


The Challenge

The problem was not that these books had been forgotten. Many had been scanned. Government and private institutions had built digital archives. But turning scans into usable, formatted Unicode text requires very high-quality Indic OCR, and that technology, for Gujarati, did not exist at the required level.

Ekatra's early partnerships with large technology companies to build Gujarati OCR systems ran into hard realities: scarce training data, low global prioritization of Gujarati, and teams that lacked the linguistic depth to handle the specific character of nineteenth and twentieth century text. Archaic vocabulary, extinct orthographic conventions, and fonts from the linotype era that bear only a passing resemblance to contemporary typefaces.

The result was approximately ninety percent accuracy. That figure sounds reasonable until you work through what it means in practice: roughly one error per line. At that error rate, proofreading costs as much as retyping the entire book. And skilled proofreaders for older Gujarati text are scarce.

So for their first thousand books, Ekatra retyped everything from scratch. That approach was not scalable to fifty thousand.


The Solution

The priority was a system that could handle the full complexity of Gujarati literary heritage. Not just modern, well-formed text, but a century's worth of typographic conventions, archaic grammar, and mixed-script usage, while making large-scale processing economically viable.

Ekatra focused on building an AI-powered document pipeline that could:

  • Process scanned books end-to-end, from raw image to fully formatted ePub
  • Handle archaic vocabulary, extinct words, and linotype-era fonts
  • Understand mixed-script conventions unique to historical Gujarati publishing
  • Reduce proofreading effort without sacrificing fidelity to the original text
  • Scale to fifty thousand books without proportional increases in cost

Ekatra partnered with Sarvam, alongside Navjivan, one of Gujarat's most respected publishers, and Cygnet Infotech. Together, the collaboration focused on the real linguistic and typographic complexities of the archive.

The technical solution is a pipeline of specialized models.

The first stage handles structure: the system identifies titles, chapters, page boundaries, columns, paragraphs, footnotes, and figures before attempting any text recognition. Layout understanding precedes character recognition.

The second stage handles text within each structural block, in context. One example illustrates why this depth matters: a century ago, when italic and bold fonts were unavailable, Navjivan used Devanagari script within Gujarati text to mark emphasis. A naive OCR system reads this as Hindi. Sarvam's system understands that these Devanagari characters are functioning as emphasis markers in a Gujarati context, and preserves the intent correctly in the final output.

The third component is a proofreading workbench that changes the nature of human review. Instead of correcting every character, proofreaders can issue high-level instructions, dramatically reducing the labor required and lowering the skill threshold for effective review.

The output is a fully formatted ePub, usable for online reading, reprinting, text overlay on original scanned pages, or full-text search indexing. From a high-quality scan, a truly preserved book is created.

The Impact

In the past year, Ekatra has moved from one error per line to one error every ten pages for mainstream books. That shift is not incremental. It is the difference between a process that is economically unworkable and one that scales.

  • Accuracy: Character recognition and layout detection have reached a level that makes large-scale processing practical
  • Cost: Processing costs expected to reach approximately ten rupees per page, and to continue declining
  • Scale: The fifty-thousand-book goal is now achievable within a two-year operational plan
  • Access: Literature previously locked in degrading physical archives is becoming readable, searchable, and shareable
  • Proofreader productivity: Human reviewers focus on high-level judgment rather than character-by-character correction

The system is not finished. Progress on archaic vocabulary, rare fonts, and complex layouts continues. But the threshold has been crossed.

Fifty thousand books. Ten million pages. Returned to the people they were written for: readable, searchable, and alive.


Build the future of India's AI with Sarvam.