Artificial Intelligence 20 min read

Ant Group Contributions to ACL 2024: Summaries of 14 Accepted Papers Across NLP and AI

From August 11‑16, 2024 the ACL conference in Bangkok featured 14 Ant Group papers covering large‑scale information extraction, decomposed LLMs for semantic search, multimodal hallucination detection, long‑context attention mechanisms, concept‑reasoning datasets, knowledge‑graph alignment, and more, highlighting the group's breadth in natural language processing and AI research.

AntTech
AntTech
AntTech
Ant Group Contributions to ACL 2024: Summaries of 14 Accepted Papers Across NLP and AI

The International Conference on Computational Linguistics (ACL) 2024 took place in Bangkok from August 11 to 16, gathering the world’s leading research in natural language processing (NLP) and artificial intelligence (AI). Ant Group contributed 14 papers that were accepted, five of which were selected for the main conference track.

IEPile: Unearthing Large Scale Schema‑Conditioned Information Extraction Corpus Link: https://arxiv.org/pdf/2402.14710 Source: Ant Group Joint Lab Fields: LLM, Information Extraction, NLP Abstract: Large language models (LLMs) excel in many tasks but lag in information extraction (IE). Existing IE datasets are small and fragmented, lacking a unified schema. IEPile builds a bilingual (Chinese‑English) IE instruction corpus of ~0.32 B tokens by aggregating 33 public IE datasets and generating schema‑guided instructions. Experiments on LLaMA and Baichuan show significant gains, especially in zero‑shot IE, and the corpus and pretrained models are released for the community.

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search Link: http://arxiv.org/abs/2406.17262 Source: Ant Group research intern Fields: Large Language Models, Semantic Search, Knowledge Distillation Abstract: Semantic search requires both efficiency and fine‑grained semantic matching. D2LLM decomposes a cross‑encoder into an efficient dual‑encoder enhanced with multi‑head attention pooling and introduces an interaction‑simulation module for nuanced semantics. Knowledge distillation from a full LLM further improves performance, surpassing five strong baselines on three benchmarks, with a notable 6.45 % boost on NLI tasks.

Unified Hallucination Detection for Multimodal Large Language Models (UNIHD) Link: https://arxiv.org/abs/2402.03190 Source: Ant Group Joint Lab Fields: Multimodal LLMs, Hallucination Detection, Evaluation Abstract: Multimodal LLMs (MLLMs) achieve impressive capabilities but frequently generate hallucinations—outputs that contradict input data or world knowledge. UNIHD proposes a tool‑enhanced detection framework that leverages auxiliary verification tools and introduces the MHaluBench benchmark for systematic evaluation. Extensive experiments demonstrate UNIHD’s effectiveness across diverse hallucination categories.

CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending Link: https://arxiv.org/abs/2309.08646 Source: Ant Group research intern Fields: Large Models, Position Embedding, Attention Mechanism Abstract: Existing rotary position embeddings (RoPE) and self‑attention interact poorly for long‑context extrapolation. CoCA introduces a collinear‑constrained attention that aligns Q and K vectors with RoPE, adding negligible overhead while extending effective context windows up to 32 K tokens. Experiments on GPT‑style and LLaMA models confirm substantial gains without fine‑tuning.

Generative Pretrained Structured Transformers (GPST) Link: https://arxiv.org/abs/2403.08293 Source: CCF‑Ant Research Fund Fields: AI, NLP, Generative Language Models Abstract: GPST presents an unsupervised syntactic language model that reconstructs hierarchical linguistic structures (character → word → phrase → sentence) while retaining Transformer expressiveness. By integrating a log‑N complexity compositional model (R2D2) and a VAE‑based synonym generator, GPST scales to 10 B tokens and outperforms GPT‑2 on downstream understanding, summarization, and syntactic generalization tasks.

CR‑LLM: A Dataset and Optimization for Concept Reasoning of Large Language Models Link: https://github.com/Nianqi-Li/Concept-Reasoning-for-LLMs Source: Ant Group research intern Fields: Knowledge Reasoning Abstract: Concept reasoning requires models to infer new entities from context, yet existing datasets suffer from knowledge and context leakage. CR‑LLM introduces a leakage‑free dataset covering eight concept types and a hybrid reasoning framework combining inductive, deductive, and controller modules, achieving a 7 % accuracy improvement over chain‑of‑thought baselines.

Efficient Knowledge Infusion via KG‑LLM Alignment Link: https://arxiv.org/abs/2406.03746 Source: Ant Group (independent) Fields: Knowledge‑Enhanced LLMs, Retrieval‑Augmented Generation, Knowledge Graphs Abstract: To address domain‑specific knowledge gaps in LLMs, the authors construct domain‑specific KG from limited annotations and large corpora, then propose a three‑stage KG‑LLM alignment strategy. Experiments on biomedical QA datasets show superior performance over existing baselines under few‑shot settings.

HOTVCOM: Generating Buzzworthy Comments for Videos Link: N/A (internal dataset) Source: Ant research intern Fields: Multimodal LLM, Video Comment Generation Abstract: HOTVCOM releases the largest Chinese video comment dataset (94 k videos, 1.37 B comments). The proposed ComHeat framework fuses visual, auditory, and textual cues to generate influential comments, achieving state‑of‑the‑art results on both the new dataset and existing benchmarks.

Context‑Aware Tracking and Dynamic Introduction for Incomplete Utterance Rewriting Link: https://openreview.net/pdf?id=jrIqqu3Wbu Source: Ant Group (independent) Fields: Large Models, NLP, Incomplete Utterance Rewriting Abstract: The CAT method tackles incomplete utterance rewriting in long multi‑turn dialogues by employing a GPT‑4‑turbo distilled tracker that dynamically updates a key‑phrase list and a context‑introduction module that filters irrelevant history, enabling efficient rewriting with t5‑base models and achieving SOTA results on three datasets.

Are U a Joke Master? Pun Generation via Multi‑Stage Curriculum Learning Link: https://github.com/cubenlp/PGCL/blob/main/PunGeneration.pdf Source: Ant Group directed collaboration Fields: Preference Alignment, Humor Generation Abstract: A multi‑stage curriculum learning framework (PGCL) equips LLMs with humor generation abilities by aligning structural and humor preferences through a triplet‑based DPO loss, substantially improving pun generation quality on ChinesePun and SemEval benchmarks.

VAEGPT‑Sim: Improving Sentence Representation with Limited Corpus Using Gradually‑Denoising VAE Link: https://openreview.net/pdf?id=6MWQHxWNMS Source: Ant Group (independent) Fields: Text Embedding, Retrieval, Few‑Shot Training Abstract: The Generate‑CSE framework augments unsupervised sentence representation learning with a VAE‑based synonym generator (VAEGPT‑Sim). By randomly applying perturbations (shuffle, delete, repeat, synonym generation) to create diverse positive pairs, the model achieves superior performance on low‑resource domains compared with existing methods.

CharPoet: A Chinese Classical Poetry Generation System Based on Token‑free LLM Link: https://arxiv.org/abs/2401.03512 Source: Ant Group (independent) Fields: Large Models, AIGC Abstract: CharPoet introduces a token‑free LLM that generates Chinese classical poetry character‑by‑character, enabling precise control over length and format. The system outperforms token‑based baselines (e.g., Jiuge‑GPT‑2, GPT‑4) with >96 % format accuracy and superior relevance in content quality.

Large Language ModelsmultimodalNLPSemantic Searchknowledge graphInformation ExtractionACL2024
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.