OpenHathi Series: An Approach To Build Bilingual LLMs Frugally
DECEMBER 12, 2023
SARVAM AI
The OpenHathi series of work at Sarvam AI is to make contributions to the ecosystem with open models and datasets to encourage innovation in Indian language AI. It is a partnership with our academic partners at AI4Bharat who have contributed language resources and benchmarks. We encourage people to innovate on top of this release by building fine-tuned models for different use-cases. Sarvam AI will additionally release enterprise-grade models on its full stack GenAI platform, which will launch soon.

The technology of large language models has seen rapid innovation both in closed models such as GPT and Claude, and in open models such as Llama and Mistral. Open models have democratised access by enabling local execution, and specialisation for custom requirements. However, these open models have very limited or no Indic language support.

As a specific example, the above graphic shows the responses of Llama-2-7B, Mistral-7B, and GPT-3.5 on the task of translating English text to Hindi. Llama-2 and Mistral, while capable of outputting text in Devanagari, are not trained to any meaningful scale in Hindi.

One of the reasons for the missing Indic language support is the paucity of high-quality and diverse Indic language content at scale. As an example, the following graphic shows the diversity in 10 million words from comparable corpora from the internet - CommonCrawl for English and the Sangraha corpus (link to be added soon) by AI4Bharat for Hindi. It shows that English has more unique words and a significantly higher mass towards the tail of the distribution. For instance, in 10M words, about 60K Hindi words occur more than 20 times, while over 90K English words have the same frequency. This is in spite of Hindi being more inflectional than English.

Also, in models such as GPT-3.5 which support various Indic languages, there is the challenge with ‘tokenisation’, wherein Indic language text is not represented efficiently. As an example, the following graphic shows the number of tokens of a few sentences in English and their corresponding translations in Hindi. Similar information content in English when expressed in Hindi has almost 3-4x higher tokens in the GPT and Llama tokenisers.

Another consideration is the rapid progress in the release of open-source models. These improvements come from usage of different datasets, different methods for alignment, use of ideas such as mixture of experts, and the ongoing research on non-transformer based models. Thus the process of adding a new language to an existing model should be generic and cost-efficient.

Missing support of Indic languages in open models, limited availability of high-quality data, and expensive tokenisation together impede innovation and adoption of large language models for Indic languages. We thus pose the following question - What is the right approach to frugally add support of a new language to an existing open model? Here frugality is captured in terms of the size of the high quality pre-training corpus, the amount of compute for training, and efficiency in inferencing. In this note, we describe our approach towards this and demonstrate its effectiveness in adding Hindi skills to the Llama-2 7B model as a first example. We release the intial version of the base model which can be found here on 🤗HuggingFace: OpenHathi-7B-Hi-v0.1-Base. We also fine-tune the base model internally and discuss its capabilities in the following video and later in the report.

The Tokeniser

The first step in adding Hindi skills to Llama-2 is decreasing the fertility score (the average number of tokens a word is split into) of its tokeniser on Hindi text. This would make both training and inferencing faster and more efficient. We train a sentence-piece tokeniser from a subsample of 100K documents from the Sangraha corpus, created at AI4Bharat, with a vocabulary size of 16K. We then merge this with the Llama2 tokeniser and create a new tokeniser with a 48K vocabulary (32K original vocabulary plus our added 16K).

The following graphic shows the fertility score of this new tokeniser across a few samples from a different sources of text. On average, 70% of Hindi words across domains are not broken into smaller tokens. This number goes upto 82% for easy to understand news items. More tokens per word are required when using formal language (e.g. Preamble of the Constitution of India) or code-mixed (Hindi and English) language (e.g. machine-generated YouTube video transcript).

When comparing against tokenisers from GPT and Llama models, the benefit of a custom tokeniser is substantial as seen in the following graphic with improvements in the range of 3 to 4 times. This is critical for bringing down the cost of inferencing when deploying these models for cost-sensitive use-cases.

A common way Hindi is written, especially on mobile devices, is through an English keyboard using Roman script (e.g., Hathi instead of हाथी), known as Romanised Hindi. The following graphic shows that the Devanagari text is a more efficient representation than the Romanised text on the original Llama tokeniser. But the difference is much lesser than in the previous graphic, suggesting that Romanised use of these models is also worth considering. We discuss this in much detail later.

Phase 1: Training with translation

With the above tokeniser, the embedding layer of the Llama-2-7B model is expanded from 32K to 48K tokens. The embeddings for these new tokens are randomly initialised.

At this point, we can do vanilla ‘continual pre-training’ of the base Llama-2 model with English and Hindi text. We instead choose to have a first phase where the model learns to translate paragraphs of text between Hindi and English. We collect 1.25B tokens of source text each in Hindi and English from the Sangraha corpus and translate these with AI4Bharat's IndicTrans2 model. We also romanise 0.5B tokens of Hindi using a rule-based method to simulate colloquially-typed text. With this data, we train the model to predict the original text given its translation thereby not training on ‘translation-ese’. We do this with a low-rank adapter to avoid affecting the model’s existing skills in English.

Training the model at this scale modifies the embeddings of the tokens. In the following graphic, we plot the two principal components of the embeddings of 75 tokens which are complete words across four categories - verbs, animals, pronouns, and adjectives. We see that while the initialised embeddings were random, after the phase of training with translation, distinct semantic clusters emerge for the four categories of words.

Phase 2: Training on bilingual next token prediction

In the next phase of training, we aim to teach the model ‘world knowledge’ in Hindi. However, as discussed, there is a paucity of content in Hindi. For instance, Wikipedia in total contains about 150K articles in Hindi (as of the time of writing) which is merely around 2% of the number of articles in English (further to differences in average length of articles). Thus, we translate content available in English sources - primarily Wikipedia using IndicTrans2. However, instead of doing monolingual next token prediction with the translated Hindi text, we propose a bilingual next token prediction, wherein alternate sentences are in Hindi and English. Given that context in a paragraph spans multiple sentences, bilingual next token prediction requires the model to cross-lingually attend to information during next-token prediction. For instance, predicting the English token in the second sentence would require attending to Hindi tokens in the preceding sentence. We hypothesise that this increases the alignment between Hindi and English. In addition, this approach naturally balances Hindi and English tokens seen during training. As in the first phase, we use a low-rank adapter in this phase as well.

The value of bilingual pre-training needs to be robustly evaluated. In some early results, we found that when fine-tuned on a common internal dataset, a model pre-trained bilingually performed better than a model trained mono-lingually on 5X larger number of tokens, as shown in the following graph.

To illustrate bilingual language modelling after training, we step through the next token prediction for two similar sentences showing the probabilities of the most probable next tokens at each step. The two sequences illustrate how the generation of the English text is adapted based on the content of the previous sentence in Hindi.



We open-source the initial version of this phase as our base model, and make it available on 🤗HuggingFace for further finetuning.

Phase-3: Supervised Fine-tuning

We finetune the base model on internal datasets and benchmark it on a variety of tasks. This tasks include standard benchmarks such as translation and also several new use-cases such as toxicity classification, text simplification, write-in-English-then-in-Hindi etc., which are motivated by real-world usage of these models.

English performance comparison against Llama-2 Base model

A standard problem in continual pretraining of an existing base model, especially when introducing a new language, is catastrophic forgetting, which is the model degrading on skills it was trained to have before. In this case, we evaluate our base model to check degradation of its English performance in comparison with Llama-2-7B model on a variety of standard English benchmarks.

We see that there is a small degradation across tasks, with our model also being slightly better on some tasks. One consideration is that our base model is trained with bilingual language modelling, and may exhibit worse performance on some monolingual benchmarks. In sum, the use of low rank adapters and additional vocabulary does not drastically affect English as Hindi skills are added.

Translation

Translation has been a central part of work on Indic language AI. We fine-tune the base model on the human-verified subset of BPCC dataset, and compare against both generative models (GPT-3.5/GPT-4) and state-of-the-art translation models (IndicTrans2/Google Translate) by evaluating on FLoRes-200 benchmark.

  • Devanagari Hindi → English: the model outperforms GPT-3.5 and GPT-4 in terms of BLEU score, while slightly trailing behind IndicTrans2 and Google Translate, which have been specifically fine-tuned for translation. (all Google Translate scores are picked from the IndicTrans2 report)
  • English → Devanagari Hindi: our model is on par with IndicTrans2 and Google Translate and outperforms GPT-3.5 and GPT-4 by a fair margin.

While most of translation has focused on Devanagari Hindi, we also explore translation between Romanised Hindi and English. Given the popularity of the Romanised input for Hindi, this task presents new opportunities for AI models to make content accessible. Existing translation systems do not provide Romanised Hindi and hence we only benchmark against GPT models.

  • Romanised Hindi ↔ English: our model is significantly better in both directions when compared to the GPT models.
    (Note: IndicTrans2 and Google Translate API does not support Romanized Hindi.)

What is remarkable is that, in our model, the translation from and to Romanized Hindi is marginally more accurate than that for Devanagari Hindi (not the case for GPT models which support both). This suggests that usage of the English token embeddings of the base Llama model is effectively shared between both English and Hindi content.

Simplified Translation

One downside of standard Machine Translation (MT) models is that they often output terse and formal language which can be difficult to understand and not used colloquially. Another common problem is their inability to transfer idioms which can be culture specific. To solve such problems, we include a simplified translation dataset created in-house to the finetuning corpus. This makes the model translations more accessible, as illustrated in the following examples, across categories.

We compared the simplified translations from our model against the outputs from GPT-3.5 with the same prompt, and used GPT-4 as an evaluator on a benchmark created by AI4Bharat (link coming soon). We found that for most examples, GPT-4 rates the output of our model to be better. A formal understanding of simplified translation and quantitative benchmarking is a useful direction given the ability of these generative models.

Write-in-English, Then-in-Hindi

Given the large amount of text the base model has seen in English, it is more capable, accurate, and produces useful answers in English. One way of leveraging this is through Chain-of-Thought (CoT) prompting, i.e., ask the model to generate outputs in English and then rewrite it in Hindi.

We evaluated this approach with MT-Bench wherein we translate the whole evaluation set (except math and coding subsets) to Hindi using IndicTrans2. When evaluated against GPT-3.5, our model is better only 12% of the time if it does not use CoT. This number improves to 40% when CoT is used.

Social Media Content Moderation

Realtime content moderation is becoming a necessary tool for stopping the spread of hateful, toxic content in social media platforms. Therefore, we partner with Koo and finetune our model to detect posts that contain a list of sensitive topics, which can then be sent for manual assessment if required. Some examples are shown below (contains inappropriate content).

We also compare this content moderation with GPT-3.5 leveraging GPT-4 as a judge, and found that GPT-4 often prefers our predictions 70% of the time. Again, more careful quantitative benchmarking is required to understand this important application.

Answering questions on a passage

A common application of LLMs is in the RAG (retrieval-augmented-generation) paradigm, wherein an external context is provided and a conversation is created grounded on this context. We fine-tuned the model to both identify correct answers and also to decline answering when the question is not answerable based on the context.

AI4Bharat created a cross-lingual benchmark (link coming soon) on which we report the results below. We found that our model underperformed, mainly because it declined to answer many questions that were answerable. We believe that improving our fine-tuning dataset will help resolve this. But upon manual inspection, we found the model to perform well on documents outside the benchmark. We will evolve both the dataset and the benchmarking methodology, given the importance of this task for real-world usage.

Kissan AI use-case

We also partnered with KissanAI to finetune our base model on conversational data that they collected. This data contains conversations from a bot powered by GPT that interacts with farmers in various languages. We then evaluated our model on a held-out test set against GPT-3.5 generation with GPT-4 as the judge. We found that our model performs worse in English, but is better in Hindi - both with native and Romanised scripts. Pratik Desai, founder of Kissan AI has put out a video sharing his experience with the model: Dhenu 1.0 - Pioneering Agriculture Model.

These usecases suggest that domain-specific models for Indic languages can be built with custom data, that is better performing than GPT-3.5 while benefiting from the efficient tokeniser.

Limitations and Conclusion

We discussed an approach to build bilingual LLMs frugally. For the Llama-2 7B model, for Hindi, we showed that if finetuned on task-specific datasets, it can be competitive and even surpass models such as GPT-3.5 which also incur a higher cost due to inefficient tokenisation. For languages with lower resources than Hindi, the performance of GPT-3.5 is lower and thus presents a bigger opportunity to support those languages natively on top of custom models.

There are several ways in which the presented approach can be scaled up. For instance, the amount of translation data shown in the translation phase can be increased - we believe 5B tokens is an achievable target for many Indic languages. In the bilingual language modelling phase, we can increase the dataset by several times from the current usage of only Wikipedia and some related sources. But our experiments showed that there is a need to filter content to be of high quality. Also the model size and family can be changed to start with a stronger base model. Perhaps a third phase of training where high-quality monolingual text is shown can also be considered.

Note: The open-sourced base model has been trained with bilingual language modelling and thus needs fine-tuning on tasks for it to be used as an instruction-following model. It has also not been aligned and thus can occasionally generate inappropriate content seen in its original pretraining.

Acknowledgements and Call to Collaboration

ओमिति एतत् अक्षरं सर्वं, तस्य उपव्याख्यानं भूतं भवद्‌ भविष्यदिति सर्वं ओङ्कार एव।
यच्चान्यत्‌ त्रिकालातीतं तदपि ओङ्कार एव ॥ [ माण्डुक्योपनिषद् / १ ]

We would like to thank AI4Bharat, IIT Madras and Prof. Mitesh Khapra for supporting this effort closely by sharing language corpora and benchmarks. They also helped with various manual evaluations of the model. We are thankful to the Bhashini program, MeitY, which has sponsored the creation of IndicTrans2. We are also grateful for the support of Nandan Nilekani and EkStep Foundation for setting us on this journey of building AI for Indian languages.

We would like to thank partners who contributed data - VerSe, Koo, and Kissan AI. The partnership specifically with VerSe was significant in recognising simplified translation, paraphrasing etc., as critical applications. We look forward to working with organisations with interest in contributing to and adopting Indic language models.

We would like to thank Meta for discussions on building on top of the Llama model, the Google Cloud team for the partnership on cloud, and Nvidia for support with efficient usage of GPUs including our inferencing setup.

The OpenHathi series is an invitation for collaboration on datasets and models. If you would be interested to contribute or need help with adoption, please write to us at openhathi@sarvam.ai.