Loading Now

Summary of Peace: a Chemistry-oriented Dataset For Optical Character Recognition on Scientific Documents, by Nan Zhang et al.


PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

by Nan Zhang, Connor Heaton, Sean Timothy Okonsky, Prasenjit Mitra, Hilal Ezgi Toraman

First submitted to arxiv on: 23 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a novel Optical Character Recognition (OCR) model that can identify text in images of chemistry publications, including tables and chemical equations. The proposed model is trained on the Printed English and Chemical Equations (PEaCE) dataset, which contains both synthetic and real-world records. To improve performance, the authors introduce transformations to mimic real-world artifacts not present in synthetic data. Experimental results show that models with small patch sizes, multi-domain training, and these proposed transformations achieve the best performance.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper is about a new way for computers to read text from pictures of science documents, like chemistry notes. This is important because current computer programs can’t always understand scientific equations or tables. The researchers created a special dataset with both made-up and real-world examples to help train better models. They also came up with ways to make the training data more realistic by adding fake “artifacts” that occur in real documents. By testing different approaches, they found that combining small image patches, multiple types of data, and these new transformations led to the best results.

Keywords

» Artificial intelligence  » Synthetic data