Summary of The Promise and Challenges Of Using Llms to Accelerate the Screening Process Of Systematic Reviews, by Aleksi Huotala et al.
The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews
by Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika Mäntylä
First submitted to arxiv on: 24 Apr 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A systematic review (SR) is a common method used in software engineering (SE), but conducting an SR takes around 67 weeks on average. Researchers are exploring ways to automate parts of the SR process to reduce the effort required. This study investigates whether Large Language Models (LLMs) can accelerate the title-abstract screening process by simplifying abstracts for human screeners and automating the screening task itself. The experiment involved humans screening titles and abstracts for 20 papers with both original and simplified abstracts, as well as LLMs like GPT-3.5 and GPT-4 performing the same tasks. The study also examined different prompting techniques to improve the performance of LLMs in screening tasks. Surprisingly, simplifying abstracts did not increase human screeners’ performance, but reduced their time spent on screening. Instead, screeners’ scientific literacy skills and researcher status predicted their performance. Some LLM and prompt combinations performed similarly to human screeners in the screening tasks. The study found that GPT-4 outperformed its predecessor, GPT-3.5. Few-shot and One-shot prompting were more effective than Zero-shot prompting. While automating title-abstract screening with LLMs shows promise, current models are not significantly more accurate than human screeners. More research is needed to recommend the use of LLMs in SR screenings. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models (LLMs) can help speed up the process of reviewing and abstracting papers. Researchers wanted to see if these AI models could simplify abstracts for humans, or even do the job themselves. They tested how well humans and LLMs did on a set of 20 papers with both original and simplified abstracts. The results showed that simplifying abstracts didn’t make it easier for humans to review them, but it did save them time. What mattered most was the reviewer’s background in science and their research experience. Some AI models performed as well as humans in screening tasks. Overall, the study suggests that while AI has potential in this area, more work is needed before we can trust these machines to do our jobs for us. |
Keywords
» Artificial intelligence » Few shot » Gpt » One shot » Prompt » Prompting » Zero shot