Summary of Post-ocr Text Correction For Bulgarian Historical Documents, by Angel Beshirov et al.
Post-OCR Text Correction for Bulgarian Historical Documents
by Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov
First submitted to arxiv on: 31 Aug 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Digital Libraries (cs.DL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a solution to the challenge of converting scanned images of historical Bulgarian documents to text using Optical Character Recognition (OCR) and subsequent text correction. The proposed method utilizes state-of-the-art Large Language Models (LLMs) and an encoder-decoder framework, augmented with diagonal attention loss, copy, and coverage mechanisms. This approach improves the quality of the documents by 25%, which is a 16% increase compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper helps preserve cultural heritage by developing a method for automatically generating synthetic data in historical Bulgarian orthographies, specifically the Drinov and Ivanchev styles. The method uses contemporary literature texts to create this synthetic data, which is then used to train LLMs and improve post-OCR text correction. This research contributes to the preservation of historical documents and makes them more accessible for search and information extraction. |
Keywords
» Artificial intelligence » Attention » Encoder decoder » Synthetic data