Loading Now

Summary of Post-ocr Text Correction For Bulgarian Historical Documents, by Angel Beshirov et al.


Post-OCR Text Correction for Bulgarian Historical Documents

by Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

First submitted to arxiv on: 31 Aug 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Digital Libraries (cs.DL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a solution to the challenge of converting scanned images of historical Bulgarian documents to text using Optical Character Recognition (OCR) and subsequent text correction. The proposed method utilizes state-of-the-art Large Language Models (LLMs) and an encoder-decoder framework, augmented with diagonal attention loss, copy, and coverage mechanisms. This approach improves the quality of the documents by 25%, which is a 16% increase compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper helps preserve cultural heritage by developing a method for automatically generating synthetic data in historical Bulgarian orthographies, specifically the Drinov and Ivanchev styles. The method uses contemporary literature texts to create this synthetic data, which is then used to train LLMs and improve post-OCR text correction. This research contributes to the preservation of historical documents and makes them more accessible for search and information extraction.

Keywords

» Artificial intelligence  » Attention  » Encoder decoder  » Synthetic data