Loading Now

Summary of Hypertext Entity Extraction in Webpage, by Yifei Yang et al.


Hypertext Entity Extraction in Webpage

by Yifei Yang, Tianqiao Liu, Bo Shao, Hai Zhao, Linjun Shou, Ming Gong, Daxin Jiang

First submitted to arxiv on: 4 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a novel approach to webpage entity extraction by introducing the Hypertext Entity Extraction Dataset (HEED) and the MoE-based Entity Extraction Framework (MoEEF). HEED is a structured dataset that retains textual content and its structure information, as well as rich hypertext features such as font color and size. The authors train models on this dataset and achieve state-of-the-art results with their proposed framework, MoEEF, which integrates multiple features using Mixture of Experts. The effectiveness of hypertext features in HEED and various components in MoEEF are also analyzed.
Low GrooveSquid.com (original content) Low Difficulty Summary
Webpage entity extraction is a crucial task that helps computers understand the content of websites. Current models are trained on datasets that only include text, but this paper shows that including additional information like font color and size can improve performance. The authors create a new dataset called HEED and develop a framework called MoEEF to extract entities from webpages. Their approach outperforms other methods and demonstrates the importance of considering hypertext features in webpage entity extraction.

Keywords

» Artificial intelligence  » Mixture of experts