Advances in Information Extraction: From PLM to LLM Paradigms at Alibaba DAMO Academy
This article reviews Alibaba DAMO Academy's research on information extraction, covering background concepts, PLM-era extraction paradigms, few‑shot extraction techniques, and the emerging LLM‑era approaches, while also sharing practical insights, benchmark results, and future directions.
Background
Information extraction (IE) is a classic NLP task that includes sub‑tasks such as entity extraction, fine‑grained entity classification, entity linking, relation extraction, and event extraction. It is widely applied in C‑end, B‑end, and G‑end scenarios, ranging from smart courier forms to medical text processing.
PLM Era Information Extraction Paradigm
The PLM era focuses on improving model performance through advanced algorithms and retrieval‑augmented techniques. Major innovations include implicit‑enhancement, retrieval‑enhancement for short texts, and multimodal extensions. A typical pipeline models IE as a sequence labeling task using a Transformer‑CRF architecture, with experiments showing significant gains across many benchmarks.
Embedding selection (e.g., BERT vs. FLAIR) influences task performance, leading to the ACE (Automatic Concatenation of Embeddings) paradigm that automatically chooses suitable embeddings via a controller‑task model framework.
Few‑Shot Information Extraction Research
To reduce costly annotation, the team proposes graph‑propagation for label transfer, a Partial‑CRF method for label distribution, and a "memory" mechanism that stores source‑model entity representations for optimal transport‑based retrieval, achieving state‑of‑the‑art results on ACL 2023.
LLM Era Information Extraction Paradigm
With large‑scale models (e.g., GPT‑3/4), two directions are explored: (1) prompt engineering and multi‑turn dialogue pipelines (ChatIE) to decompose IE tasks, and (2) training task‑specific LLMs on millions of annotated examples, unifying various IE subtasks and achieving superior performance.
Q&A
Q1: How to filter noise in multimodal image‑text IE? – Use multi‑view learning and KL‑divergence soft‑label alignment.
Q2: How to handle overly long retrieval contexts? – Encode each retrieved document into vectors and apply cross‑attention between BERT tokens and retrieval vectors.
Q3: Can IE and structuring boost general pre‑training? – Yes, by converting text to knowledge graphs or using retrieval‑augmented pre‑training, albeit with higher compute cost.
Conclusion and Outlook
The talk summarizes three themes: (1) PLM‑era algorithmic advances with retrieval‑enhancement, (2) few‑shot IE via data augmentation and model knowledge reuse, and (3) LLM‑era efficient prompting and task‑specific model construction. The speaker emphasizes that IE will remain valuable for end‑to‑end tasks requiring speed, interpretability, and controllability, even as large models evolve.
Resources: https://github.com/Alibaba-NLP/SeqGPT and ModelScope model https://www.modelscope.cn/models/damo/nlp_seqgpt-560m/ .
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.