Loading Now

Summary of Language Models and Retrieval Augmented Generation For Automated Structured Data Extraction From Diagnostic Reports, by Mohamed Sobhi Jabal et al.


Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports

by Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese

First submitted to arxiv on: 15 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Information Retrieval (cs.IR); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG). The authors evaluate various LMs and RAG configurations to determine the impact of model size, quantization, prompting strategies, output formatting, and inference parameters on extraction performance. They utilize two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. The results show that larger, newer, and domain fine-tuned models consistently outperform older and smaller models, with the best-performing models achieving over 98% accuracy in extracting BT-RADS scores and over 90% for IDH mutation status extraction. The authors conclude that open LMs demonstrate significant potential for automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a machine that can automatically extract important information from unstructured medical reports. They tested different models and found that bigger, newer, and more specific models work better than older and smaller ones. The best models were able to accurately extract information about brain tumors and cancer mutations. This technology could be used in hospitals to help doctors and researchers quickly get the information they need.

Keywords

» Artificial intelligence  » Inference  » Prompting  » Quantization  » Rag  » Retrieval augmented generation