Loading Now

Summary of Smirk: An Atomically Complete Tokenizer For Molecular Foundation Models, by Alexius Wadell et al.


Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

by Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

First submitted to arxiv on: 19 Sep 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper evaluates thirty tokenizers, including nineteen chemistry-specific ones, for their coverage of the SMILES molecular representation language. The authors reveal significant gaps in existing tokenizers’ ability to capture molecular space. To assess the impact of tokenizer choice, they introduce n-gram language models as a low-cost proxy and validate their effectiveness by training and fine-tuning RoBERTa-style encoders for molecular property prediction. The study also proposes two new tokenizers, Smirk and Smirk-GPE, with full coverage of the OpenSMILES specification. This research highlights the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics, facilitating applications in pharmacology, agriculture, biology, and energy storage.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how well different tools can understand molecules’ structure. The current tools are limited because they only use a small part of the information available. The authors tested thirty different tools to see which ones could capture more details about molecules. They found that most tools were missing important information, and some even got stuck on certain types of molecules. To fix this problem, the researchers created two new tools called Smirk and Smirk-GPE that can understand all the details about molecules. This will help scientists make better predictions about how molecules behave and could lead to breakthroughs in fields like medicine, agriculture, and energy.

Keywords

» Artificial intelligence  » Fine tuning  » N gram  » Tokenizer