Summary of Peace: a Chemistry-oriented Dataset For Optical Character Recognition on Scientific Documents, by Nan Zhang et al.

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

by Nan Zhang, Connor Heaton, Sean Timothy Okonsky, Prasenjit Mitra, Hilal Ezgi Toraman

First submitted to arxiv on: 23 Mar 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper presents a novel Optical Character Recognition (OCR) model that can identify text in images of chemistry publications, including tables and chemical equations. The proposed model is trained on the Printed English and Chemical Equations (PEaCE) dataset, which contains both synthetic and real-world records. To improve performance, the authors introduce transformations to mimic real-world artifacts not present in synthetic data. Experimental results show that models with small patch sizes, multi-domain training, and these proposed transformations achieve the best performance.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper is about a new way for computers to read text from pictures of science documents, like chemistry notes. This is important because current computer programs can’t always understand scientific equations or tables. The researchers created a special dataset with both made-up and real-world examples to help train better models. They also came up with ways to make the training data more realistic by adding fake “artifacts” that occur in real documents. By testing different approaches, they found that combining small image patches, multiple types of data, and these new transformations led to the best results.

Keywords

» Artificial intelligence » Synthetic data

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

by Nan Zhang, Connor Heaton, Sean Timothy Okonsky, Prasenjit Mitra, Hilal Ezgi Toraman

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Language-based Depth Hints For Monocular Depth Estimation, by Dylan Auty and Krystian Mikolajczyk

Summary of Lamper: Language Model and Prompt Engineering For Zero-shot Time Series Classification, by Zhicheng Du et al.

Related Posts