Summary of Post-ocr Text Correction For Bulgarian Historical Documents, by Angel Beshirov et al.

Post-OCR Text Correction for Bulgarian Historical Documents

by Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

First submitted to arxiv on: 31 Aug 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper presents a solution to the challenge of converting scanned images of historical Bulgarian documents to text using Optical Character Recognition (OCR) and subsequent text correction. The proposed method utilizes state-of-the-art Large Language Models (LLMs) and an encoder-decoder framework, augmented with diagonal attention loss, copy, and coverage mechanisms. This approach improves the quality of the documents by 25%, which is a 16% increase compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper helps preserve cultural heritage by developing a method for automatically generating synthetic data in historical Bulgarian orthographies, specifically the Drinov and Ivanchev styles. The method uses contemporary literature texts to create this synthetic data, which is then used to train LLMs and improve post-OCR text correction. This research contributes to the preservation of historical documents and makes them more accessible for search and information extraction.

Keywords

* Artificial intelligence * Attention * Encoder decoder * Synthetic data

Post-OCR Text Correction for Bulgarian Historical Documents

by Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Rapid Gyroscope Calibration: a Deep Learning Approach, by Yair Stolero and Itzik Klein

Summary of How Does Diverse Interpretability Of Textual Prompts Impact Medical Vision-language Zero-shot Tasks?, by Sicheng Wang et al.

Related Posts