Summary of Peace: a Chemistry-oriented Dataset For Optical Character Recognition on Scientific Documents, by Nan Zhang et al.
PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents
by Nan Zhang, Connor Heaton, Sean Timothy Okonsky, Prasenjit Mitra, Hilal Ezgi Toraman
First submitted to arxiv on: 23 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a novel Optical Character Recognition (OCR) model that can identify text in images of chemistry publications, including tables and chemical equations. The proposed model is trained on the Printed English and Chemical Equations (PEaCE) dataset, which contains both synthetic and real-world records. To improve performance, the authors introduce transformations to mimic real-world artifacts not present in synthetic data. Experimental results show that models with small patch sizes, multi-domain training, and these proposed transformations achieve the best performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper is about a new way for computers to read text from pictures of science documents, like chemistry notes. This is important because current computer programs can’t always understand scientific equations or tables. The researchers created a special dataset with both made-up and real-world examples to help train better models. They also came up with ways to make the training data more realistic by adding fake “artifacts” that occur in real documents. By testing different approaches, they found that combining small image patches, multiple types of data, and these new transformations led to the best results. |
Keywords
» Artificial intelligence » Synthetic data