Loading Now

Summary of Shiksha: a Technical Domain Focused Translation Dataset and Model For Indian Languages, by Advait Joglekar and Srinivasan Umesh


Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

by Advait Joglekar, Srinivasan Umesh

First submitted to arxiv on: 12 Dec 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper tackles the challenge of developing Neural Machine Translation (NMT) models that can effectively translate scientific, technical, and educational content in low-resource Indian languages. The existing NMT models struggle with tasks involving scientific understanding or technical jargon due to limited exposure to these domains during training. To address this issue, the authors create a multilingual parallel corpus containing over 2.8 million high-quality translation pairs across eight Indian languages. This corpus is generated through bitext mining of human-translated transcriptions of NPTEL video lectures. The authors fine-tune and evaluate NMT models using this corpus, achieving state-of-the-art performance on in-domain tasks and improving the baseline by over 2 BLEU points on average for out-of-domain translation tasks on the Flores+ benchmark. The authors release their model and dataset via a public link.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research paper is about making it easier to translate important information like scientific papers or technical instructions from one language to another, especially in languages that are not well-studied. Currently, computers struggle with translating these types of texts because they don’t have enough examples to learn from. To fix this problem, the authors created a large database of translated text pairs across eight Indian languages. They used this data to train computer models that can translate scientific and technical content more accurately. The new models performed much better than existing ones on similar tasks and even showed improvement when translating texts not in their usual domain.

Keywords

» Artificial intelligence  » Bleu  » Translation