Summary of Knowcoder: Coding Structured Knowledge Into Llms For Universal Information Extraction, by Zixuan Li et al.
KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction
by Zixuan Li, Yutao Zeng, Yuxin Zuo, Weicheng Ren, Wenxuan Liu, Miao Su, Yucan Guo, Yantao Liu, Xiang Li, Zhilei Hu, Long Bai, Wei Li, Yidan Liu, Pan Yang, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
First submitted to arxiv on: 12 Mar 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes KnowCoder, a Large Language Model (LLM) that conducts Universal Information Extraction (UIE) via code generation. It aims to develop a unified schema representation that LLMs can understand and an effective learning framework for extracting structured knowledge accurately. To achieve this, KnowCoder introduces a code-style schema representation method to transform different schemas into Python classes, capturing complex schema information like constraints among tasks in UIE. The paper also constructs a large library of over 30,000 types of knowledge, the largest for UIE, and develops a two-phase learning framework that enhances schema understanding via code pretraining and instruction tuning. KnowCoder is trained on around 1.5 billion automatically constructed data and achieves remarkable generalization ability, outperforming LLaMA2 by 49.8% F1 under the few-shot setting. After instruction tuning, KnowCoder exhibits strong generalization ability on unseen schemas and surpasses sota baselines by 12.5% and 21.9% under zero-shot and low-resource settings, respectively. The unified schema representations enable the utilization of various human-annotated datasets to refine KnowCoder, achieving significant improvements up to 7.5% under supervised settings. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a new language model called KnowCoder that can understand different types of information and extract structured knowledge. It does this by using a special way to represent information in code, which makes it easier for the model to learn from and use this information. The researchers also create a large library of over 30,000 types of knowledge, which is the largest of its kind. The model is trained on a huge amount of data and can generalize well to new situations. It outperforms other models in certain settings and can even improve its performance by using human-annotated datasets. This technology has the potential to make it easier for computers to understand and work with different types of information, which could lead to many new applications. |
Keywords
* Artificial intelligence * Few shot * Generalization * Instruction tuning * Language model * Large language model * Pretraining * Supervised * Zero shot